187. Repeated DNA Sequences

Problem

All DNA is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T', for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

 

Example 1:

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

Example 2:

Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]

 

Constraints:

  • 0 <= s.length <= 105
  • s[i] is 'A', 'C', 'G', or 'T'.

Discussion

This problem requires finding substring with fixed-length-of-10 and occur-more-than-once in a given input string.

Instead of scanning by brutal force which takes O(n^2), we can make use the idea of sliding window algorithm. Then it's easy to find all substrings-of-size-10 and count their frequency.

Solution

Our solution consists of two steps, counter-construction and frequency-counting.

First we make use of the idea of sliding-window-algorithm to scan all substring of size-10 in one linear iteration, with two pointers moving together with fixed distance of 10. The substrings are cached in a counter (or hashmap) for frequency counting.

Then we simply return all substrings of frequency more than once.

from collections import Counter

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        cache = Counter([s[i: i+10] for i in range(len(s) - 9)])
        return [key for key in cache.keys() if cache[key] > 1]

Complexity Analysis

  • Time Complexity: O(n), given a linear iteration with two pointers.
  • Space Complexity: O(n), as a substring counter is saved in memory.