187. Repeated DNA Sequences
Problem
All DNA is composed of a series of nucleotides abbreviated as 'A'
, 'C'
, 'G'
, and 'T'
, for example: "ACGAATTCCG"
. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT" Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA" Output: ["AAAAAAAAAA"]
Constraints:
0 <= s.length <= 105
s[i]
is'A'
,'C'
,'G'
, or'T'
.
Discussion
This problem requires finding substring with fixed-length-of-10 and occur-more-than-once in a given input string.
Instead of scanning by brutal force which takes O(n^2)
, we can make use the
idea of sliding window algorithm. Then it's easy to find all
substrings-of-size-10 and count their frequency.
Solution
Our solution consists of two steps, counter-construction and frequency-counting.
First we make use of the idea of sliding-window-algorithm to scan all
substring of size-10 in one linear iteration, with two pointers moving together
with fixed distance of 10. The substrings are cached in a counter (or hashmap)
for frequency counting.
Then we simply return all substrings of frequency more than once.
from collections import Counter
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
cache = Counter([s[i: i+10] for i in range(len(s) - 9)])
return [key for key in cache.keys() if cache[key] > 1]
Complexity Analysis
- Time Complexity:
O(n)
, given a linear iteration with two pointers. - Space Complexity:
O(n)
, as a substring counter is saved in memory.