Predictive coding is hard to master, not very flexible and quite limited in its uses. That’s the perception, said Hal Marcus. Only it’s wrong.

Marcus, an e-discovery attorney and director of product marketing at OpenText Solutions, was the moderator of a webinar on predictive coding that we hosted on June 1. (To view it, go to, fill out the form and then click on “Register.”) The title of his first slide framed the discussion: The Perception Problem. Beneath it was a quote from an Ari Kaplan survey: “There is so much promise associated with predictive coding, but it is a heavy lift and not very flexible.”

“I find this hard to reconcile,” said Marcus. His company has lots of clients that use predictive coding on a wide range of matters, and the uses keep growing. It’s used for investigations, including M&A due diligence, he said. It can assist in constructing a data breach response. For the standard document reviews for which it was first used, it can help prioritize which documents ought to be looked at first. And yet, he said, there’s a “strong perception that it’s for one case out of 20.”

The puzzle he asked four panelists to help him solve was: What is behind this misimpression? They came up with six statements about predictive coding that seem to be widely believed. One by one, they knocked them down.

The first perception is that predictive coding depends on perfect “seed” sets or control sets created by subject matter experts. Not true, said panelist Ethan Ackerman, an associate at Morgan, Lewis & Bockius. “That’s a myth,” he said. “It works with any documents you give it.” And it doesn’t matter who chooses them. The procedure is simple: It looks for “more like this.”

The second perception is that you need to have all the data up front before you proceed to use the software. “That’s one we hear quite a lot,” noted Marcus. For this one, he turned to his OpenText colleague Alexis Mitchell, the company’s principal data scientist and work flow consultant. “When we’re talking about continuous machine learning,” Mitchell said, “the system is learning continuously through the entire review project.” You can add new data at any time, and it’s just “folded in.”

This ability, said Ackerman, is an indication of how robust the software really is. When new information rolls in, “you don’t lose the benefit of the learning,” he observed. Rather, “you extend the knowledge to new data.”

The next issue they dealt with was how much math training you need to use the software. Some of what’s been written describes the software as a very complicated model “that requires really strong skills,” Marcus said. The reality, he added, is that “the skill level you need is nowhere near what some people expect.”

Dawson Horn, associate general counsel at AIG, confirmed that. “It’s not like you have to have a Ph.D. in statistics to figure it out,” he said. All the math skills are geared to help a user assess whether the program is accurately identifying documents similar to the ones you tell it are relevant. It helps, Horn added, if you estimate in advance the total number of responsive documents you expect, and then set that as your target. Once you hit that number, you should have arrived at a point of diminishing returns.

Two factors may work against each other, he continued. The more comprehensive the review, the less efficient and precise it’s likely to be. That’s because a comprehensive review will necessarily include more irrelevant documents. Conversely, a highly efficient review may lack the breadth of comprehension. But keep in mind, Horn added, “the review does not have to be perfect. It merely has to be reasonable.”

One more point. In most instances, predictive coding is not the only analytic tool you will be using to analyze the data, Horn noted. You can always check your results, and tinker with the data, using those other tools.

Another common belief is that predictive coding is a blunt, binary instrument that has limited capabilities. Not true, countered Ackerman. You can get it to do many tasks, once you know how to frame the questions. “If you can characterize a document,” he said, “you can predictive code on that aspect.” Two of his personal favorites: “Contains good jokes. Contains bad jokes.”

Another aspect that’s likely to be a tad more important to lawyers is “privileged.” He described this one as “the Holy Grail of document review in discovery.” And yes, Ackerman said, you can use this software to identify relevant documents.

The next “myth” is the notion that you can only use this technique on very large data sets. Conventional wisdom suggests that you need at least 100,000 documents to use predictive coding, Ackerman said. But that’s another false assumption. He described cases that were successfully concluded using only a fifth of that number.

The final misconception involves discovery production. “There’s the perception that this is going to be complicated,” Marcus said. The expectation is that it will involve “a lot of disclosure.” There will be “negotiations with the other side,” and it will require approval, “possibly from the judge.” Often, in anticipation of these challenges, attorneys say, “I don’t really want to get into the hassle of that.”

Kiriaki Touriskis, assistant general counsel at JPMorgan Chase, took this one on. “From an in-house perspective,” she said, “for me this is probably the most frustrating barrier.” Her personal opinion is that there’s no legal obligation to disclose. The answer could depend on the type of matter that’s under review. If it’s a regulatory matter, then more disclosure may be necessary, she acknowledged. If it’s a civil matter, it would depend on the judge and the opponents. It’s a conversation to have in the early stages of litigation, she suggested.

There’s also confusion about what needs to be disclosed, when disclosure is required. It’s not about “explaining the black box.” It’s about the statistics, who conducted the review, how many documents you expected to find and how many you actually did. “These are questions a good lawyer, a good project manager should be asking in the course of a linear review,” she said. And all you really need to explain is the work flow.

Marcus added a coda to wrap up the webinar. “At the end of the day, it comes down to defending your decision to stop reviewing,” he said. “And it’s validation that will give you what you need.”

It all comes down to proportionality, Marcus emphasized. No one is expected to review every document. That’s not the standard, and it never has been. You just have to show what you’ve done and demonstrate that you validated that it’s complete.

About the author

Metropolitan Corporate Counsel

Leave a Comment