Internal Investigations: 10 Ways To Be A Cyber Sleuth

Internal Investigations: 10 Ways To Be A Cyber Sleuth

By Caroline Sweeney

(The following article was first published on August 15, 2017,  on  Law360, written by Caroline Sweeney, the global director for e-discovery and client technology at Dorsey & Whitney LLP and a faculty member for the Compliance, Governance and Oversight Council (CGOC).)

Law360, New York (August 15, 2017, 1:00 PM EDT) —

Cyberattacks aren’t the only significant threats facing enterprises today. Companies often find themselves needing to conduct extensive and costly investigations into employee behavior. For example, I was recently involved in an internal investigation that was estimated to cost a global Fortune 500 company more than $1 million just for the investigation itself. Of course, costs can run much higher when settlements and other legal fees are included. According to the Mintz Group, Foreign Corrupt Practices Act penalty amounts totaled $1.8 billion for the period from implementation of the FCPA in 1977 to May 2016. The financial impact to companies, as well as the damage to their reputations and business disruption, can be staggering, and possible litigation following an investigation can cause further financial and reputational harm.

Internal investigations are launched in response to a variety of situations. In an employment context, if there is a suspicion an employee who left a company has taken proprietary information, internal investigators must determine what might have been taken and how. When an employee alerts authorities or corporate executives of potential wrongdoing by other employees or the organization, investigators must immediately research the allegations. When a government agency or foreign investigative body — such as the U.S. Department of Justice, the U.S. Securities and Exchange Commission, the Consumer Financial Protection Bureau, the U.K.’s Serious Fraud Office, the Australian Securities and Investments Commission, or the newly formed Agence Française Anti-corruption — launches an investigation into financial fraud, money laundering, illegal payments in exchange for preferential treatment, or other potential wrongdoing, investigators need to conduct a methodical investigation and be able to defend their processes and findings to the regulatory agencies. Further, because investigations can lead to litigation, the importance of a defensible process in the investigation is critical to an effective e-discovery strategy in the future.

Investigators in all these situations face two common challenges. First, finding answers often lies in analyzing mountains of data. Second, the answers must be found quickly. Investigative speed has the potential not only to limit the damage caused by the original transgression, but also to dramatically reduce legal costs and manage corporate risk.

Given the overwhelming complexity of today’s data environments, meeting these challenges depends on investigators taking advantage of a variety of technology tools to apply both tried-and-true and new analytical techniques to data investigations. Investigators must also develop the mindset of a relentless and detailed forensic detective. With that in mind, here are the top 10 best practices for becoming a cyber sleuth.

1. Concept Analysis and Vetting of Key Terms

In traditional e-discovery, the goal is typically to reduce the data population for review by applying keywords, which are negotiated with the opposing party. In an internal investigation, the goal is to be more flexible and open-minded. Early in an investigation, investigators may not know what the most effective keywords are. By spending some time analyzing the concepts (nouns and noun phrases) in your collection, you can ensure a more strategic approach to determining potential search terms to narrow the focus of your investigation. This concept list can be helpful in vetting terminology, identifying search terms, and suggesting synonyms you had not considered.

For example, concepts — such as “profiting,” “exchanging” or “time sensitivity” — can guide the development of specific search terms and point you to helpful documents. It is also important to identify words that seem odd but that appear regularly — such as “my grandmother” or “garden” in an otherwise dry business email — which may indicate a code word. And, of course, if you see inappropriate language or terms such as “write-off” you may want to look closely at those documents to see exactly what was being discussed.

If a predetermined keyword list does exist, you can identify which terms are returning responsive documents by reviewing a random sample of documents hitting on those search terms and then analyzing the results to identify search terms that are returning primarily responsive documents. As discussed below, those responsive documents can be compared to the unreviewed population to identify other documents that are conceptually similar to the known responsive documents. This is a great way to help you quickly identify key documents in the population.

2. Timeline Gap Analysis

In traditional e-discovery, timeline gap analysis is used to confirm that relevant email and documents from specific individuals are not missing from specific time frames. Similarly, in an investigative context, timeline gap analysis can be used to ensure that all the documents and emails from a particular custodian have been accounted for and collected.

The ability to identify spikes and lulls in communication is important. A lull in email communication could indicate that another channel of communication was being used and should be investigated. Or, it could mean documents or communications were deleted because they contained content that implicated wrongdoing. A timeline gap analysis may also indicate that communications on a particular topic were occurring sooner or later than originally thought.

With the huge amount of data that investigators must wade through, only technology is really capable of ensuring that every type of relevant information — email, documents, text messages, calendar entries, note-taking entries, etc. — are being evaluated in order to build a full picture of what was occurring during a particular time frame.

3. Communications Analysis

Communications analysis is helpful on multiple levels. First, it identifies who was communicating with whom, that is, which custodians had direct communications and which may have been passing information on to, or through, other parties. This is critical to identifying individuals who may not have been on the radar at the beginning of the investigation.

In many cases, for example with trade secrets investigations, it is imperative to identify which custodians are forwarding information to personal email accounts. This may simplify the processing of honing in on relevant documents.

Communications analysis is also important to identifying unusual channels of communications. For example, by analyzing the domain names of email addresses, we once figured out that fantasy sports communications — which usually appear to be mere entertainment or spam and are therefore often ignored in e-discovery — was being used by employees to hide inappropriate behavior.

4. Continuous Active Learning

Traditional predictive coding has been accepted by the courts for e-discovery but often involves so much negotiation that adoption is not as widespread in litigation as it could be. In the investigative context, however, we have had great success with continuous active learning (CAL), which is often referred to as predictive coding 2.0 or technology-assisted review (TAR) 2.0. CAL technology is an important way to hone in on the most relevant documents quickly. CAL does not require training sets as earlier versions of predictive coding did. Instead, CAL allows us to immediately apply the results of reviewed documents to the unreviewed population to identify other, likely-to-be relevant documents and prioritize them for review. Think of it as if you pointed to the relevant and key documents and instructed the platform to “find more documents like these.” While an oversimplification, this is essentially what CAL does for you.

CAL is an iterative process. After scoring the results of the first run, the algorithm can be run at regular intervals during the review to continue to improve accuracy and percolate to the top those documents most likely to be relevant or key. Since CAL relies upon textual content, understanding the types of documents in your collection will help you to strategically deploy CAL to ensure the best results.

CAL has many benefits, including being applicable to review populations of varied sizes, greater flexibility over TAR 1.0, the ability to start a review as soon as you have documents and then educate the system on responsive calls as you learn and add to the population, the ability to apply it against foreign language content, a reduction in the number of nonresponsive documents that require review, and helping to find the “good stuff” early in the review process. A key benefit of CAL is that it reduces time and cost. In one case, we had collected over 600,000 documents. While search terms and date restrictions reduced the population for review to 94,000 documents, by using CAL we ultimately reviewed only 21,351 documents (or 22.7 percent of the search term hits in our date range). This shaved weeks off of the review and saved our client tens of thousands of dollars in review costs. In another investigation, we had collected in excess of 20 million documents, and search terms hit on nearly 1 million. Using CAL, we reviewed only around 200,000 documents, again saving the client hundreds of thousands of dollars.

5. Concept Clustering

Concept clustering is another technology for grouping similar documents together. While concept clustering can be used before CAL to identify sets of documents that might be relevant and therefore a good place to start investigating, we most often use concept clustering once documents have been batched and sent out to the review team members. The review team uses concept clustering to organize the review sets into groups of conceptually similar topics. By evaluating the relevance of each cluster, reviewers can get to the most potentially responsive documents more quickly.

Some concept clustering solutions are text-based, that is, the reviewer is provided with a list of concepts present in the batch. Other platforms present the clusters visually. Both work well, though we have found visual clustering particularly helpful in allowing teams to quickly identify documents they want to review first.

6. Email Threading

Email threading is a way to organize all emails in an email string. This enables investigators to see each email in context and ensure all emails related to a conversation are considered. Identifying strings this way can reduce time and costs by allowing investigators to review the most comprehensive emails in the string (e.g., those that include all the previous emails) and to identify where strings may diverge. For example, if, in the course of an email conversation, one of the participants suddenly starts copying a third person or starts forwarding information to a new person, this may prove relevant. Additionally, missing threads in an email string may also prove to be an avenue of investigation to determine why parts of the conversation are missing from the collection.

7. Advanced and Similarity Searches

A technology platform’s advanced search functionality is vital to identifying documents that contain keywords used in a particular context. We leverage Boolean, Proximity, Fuzzy, Wild Card and Stemming Search functionality. We have also utilized pattern matching to identify, for example, phone numbers or patent numbers that are search terms, or personally identifiable information (PII) that requires redaction. Understanding how to leverage the search functionality and properly apply search syntax is critical. Assessing the responsive rates, as mentioned earlier, can also be helpful in ensuring search terms are returning the right documents — those that further your investigation.

We often use search terms in conjunction with other technology tools. In one investigation, we had a single letter that was a search/code term, and being able to search for that letter while leveraging CAL was vital to reducing the number of documents returned by the very overbroad, single-letter term.

Similarity searches can also be used to identify near duplicate documents. A near duplicate may simply indicate an earlier draft of a document, but we have also seen cases where it indicates a document that has been slightly altered to help perpetrate a fraud. Exact duplicate searches (by searching an All Custodian or MD5Hash field) can also be helpful, providing evidence, for example, that two people received the same information at the same meeting.

8. Event Chronologies

Creating event chronologies enables teams to go deeper to associate information with particular events. In complex matters, this technology enables teams to build a chronology of events and link directly to the evidence in the database, whether the evidence is particular documents, witness interviews that have been uploaded to the database, or other items. The ability to link data to chronologies facilitates building a complete narrative that can be shared for better collaboration. Similar to time gap analysis, this tool in the technology toolkit can also help teams identify where the gaps in the chronology are, so investigators can determine what they need to fill in.

9. Review Process Agility

The ability to adapt the review process to new information is critical. Frequently, investigators begin an investigation targeting one type of behavior or event only to uncover something very different. From a strategic perspective, the technology tools must be flexible enough to allow investigators to change direction. For example, they may start with a set of search terms that they think will help prove or disprove a case, but they then find that the search terms are not working well, while a few key documents lead them down a completely different path. Investigators must have the technical flexibility, that is, sufficient tools in their toolkit, to go wherever they need to.

Review process agility is also about having regular communications between the review team and the legal team. What are the reviewers seeing, what are they surprised by, what is bothering them? The more back-and-forth there is, the more you can be assured of the thoroughness of the investigative process.

Additionally, review process agility means being able to accommodate different review processes for different types of data formats, for example, being able to review mobile device or chat data that does not present itself as easily for review as do email or other electronic documents. Some investigations may require a process to review partial documents that have been recovered from slack or unallocated space on a computer. Having the technical tools and a review process in place for these situations is important to the success of the investigation.

10. End-to-End Process Flexibility and Adaptability

An extension of review process agility is end-to-end process flexibility and adaptability. This involves having the technology to deal with as many investigative situations as possible.

For example, foreign language content is often part of a review. Processing technology can flag foreign language content in the collection, alerting you to the types of languages and volume of documents that require foreign language review. Reviewing foreign language content can be tricky, and teams must weigh the cost of machine-based translation versus hiring translators or relying on language-specific review teams. We have found that current technology is very good at translating European languages with sufficient clarity to determine potential relevance. This way, only the potentially relevant documents need to be reviewed by a language-specific review team. This can save a tremendous amount of money. Unfortunately, not all languages — especially Chinese and Japanese — lend themselves to machine translation, and we typically use language-specific review teams for these reviews.

In one investigation, we used machine translation on French language documents to help us assess relevancy. Anything deemed relevant then went to the French review team for confirmation. The relevant French documents then went through the CAL assessment to help us locate similar content.

File path analysis provides the ability to determine where individuals were storing documents. This type of forensic analysis can, for example, help teams determine if files were stored in nonobvious locations, potentially to prevent detection should someone check the computer. In one case, we were able to use file path analysis to identify instant messaging communications that synched with the email platform, so we were able to prioritize those less formal communications for review. File type analysis can also be useful in helping you to identify review needs, or perhaps in identifying documents that can be deprioritized in the review process because they will likely be irrelevant based on their storage location.

In some investigations, we have needed to review audio content. Again, having the technology in place to assist with audio review can be invaluable. There are tools that enable searching without transcribing the audio file or having to fully listen to hours and hours of recorded calls. Conducting a file type analysis can help you determine if your collection has substantial audio content, so you can strategize how best to investigate this data source.

Finally, as mentioned earlier, investigators need to document the process undertaken so it can be defended to investigating bodies, the corporate board, or in court should litigation arise. What tools were used to identify individuals and relevant information? How did you ensure relevant sources were not neglected? How was data collected and processed? Was it deduped globally or within custodians? Did the investigation rely on CAL? Were face-to-face interviews conducted? While documenting all this is important in e-discovery, it may be even more important in investigations because the evidence is often turned over to external agencies that must be able to understand and trust the process.


When it comes to being a cyber sleuth today, even Sherlock Holmes would need modern technology tools to consume mountains of data, uncover relationships, identify connections over time, prioritize evidence, and document his processes. If an organization’s investigators have a technology toolkit that enables them to follow the 10 best practices above, they will have everything they need (except perhaps Holmes’ hat and pipe) to allow their keen intellects to roam free and quickly solve even the most complex conundrum.


Caroline Sweeney is the global director for e-discovery and client technology at Dorsey & Whitney LLP and a faculty member for the Compliance, Governance and Oversight Council (CGOC). She is also a member of The Sedona Conference Working Group on Electronic Document Retention and Production and sits on the information governance steering committee for the International Legal Technology Association.

Share With A Friend: