문제해결(PS)/ROSALIND

k-Mer Composition

곰탱이장 2024. 9. 7. 18:41

 

https://rosalind.info/problems/kmer/

 

ROSALIND | k-Mer Composition

It appears that your browser has JavaScript disabled. Rosalind requires your browser to be JavaScript enabled. k-Mer Composition solved by 3385 2012년 7월 13일 12:00:00 오전 by Gabriel Valiente Topics: String Algorithms Generalizing GC-Content Figure

rosalind.info

Problem

For a fixed positive integer kk, order all possible k-mers taken from an underlying alphabet lexicographically.

Then the k-mer composition of a string ss can be represented by an array AA for which A[m]A[m] denotes the number of times that the mmth k-mer (with respect to the lexicographic order) appears in ss.

Given: A DNA string ss in FASTA format (having length at most 100 kbp).

Return: The 4-mer composition of ss.

Sample Dataset

>Rosalind_6431
CTTCGAAAGTTTGGGCCGAGTCTTACAGTCGGTCTTGAAGCAAAGTAACGAACTCCACGG
CCCTGACTACCGAACCAGTTGTGAGTACTCAACTGGGTGAGAGTGCAGTCCCTATTGAGT
TTCCGAGACTCACCGGGATTTTCGATCCAGCCTCAGTCCAGTCTTGTGGCCAACTCACCA
AATGACGTTGGAATATCCCTGTCTAGCTCACGCAGTACTTAGTAAGAGGTCGCTGCAGCG
GGGCAAGGAGATCGGAAAATGTGCTCTATATGCGACTAAAGCTCCTAACTTACACGTAGA
CTTGCCCGTGTTAAAAACTCGGCTCACATGCTGTCTGCGGCTGGCTGTATACAGTATCTA
CCTAATACCCTTCAGTTCGCCGCACAAAAGCTGGGAGTTACCGCGGAAATCACAG

Sample Output

4 1 4 3 0 1 1 5 1 3 1 2 2 1 2 0 1 1 3 1 2 1 3 1 1 1 1 2 2 5 1 3 0 2 2 1 1 1 1 3 1 0 0 1 5 5 1 5 0 2 0 2 1 2 1 1 1 2 0 1 0 0 1 1 3 2 1 0 3 2 3 0 0 2 0 8 0 0 1 0 2 1 3 0 0 0 1 4 3 2 1 1 3 1 2 1 3 1 2 1 2 1 1 1 2 3 2 1 1 0 1 1 3 2 1 2 6 2 1 1 1 2 3 3 3 2 3 0 3 2 1 1 0 0 1 4 3 0 1 5 0 2 0 1 2 1 3 0 1 2 2 1 1 0 3 0 0 4 5 0 3 0 2 1 1 3 0 3 2 2 1 1 0 2 1 0 2 2 1 2 0 2 2 5 2 2 1 1 2 1 2 2 2 2 1 1 3 4 0 2 1 1 0 1 2 2 1 1 1 5 2 0 3 2 1 1 2 2 3 0 3 0 1 3 1 2 3 0 2 1 2 2 1 2 3 0 1 2 3 1 1 3 1 0 1 1 3 0 2 1 2 2 0 2 1 1

 

 이 문제는 4-mer compostion을 사전순으로 구하고 각 composition이 서열에 몇개나 들어가는지 알아내는 문제이다. 4-mer compostion는필자는 DFS를 응용하여 구했다. 그리고 이렇게 구한 것을 딕셔너리의 키로 하고 밸류를 0으로 하여, 서열을 돌며 4개의 그 서브 서열을 키로 넣어 1씩 더하게 하여 개수를 구했다.

from Bio import SeqIO

def making_mer(s):
    if len(s)==4:
        mer[s]=0
        return
    for i in 'ACGT':
        n=s+i
        making_mer(n)
        n=s

if __name__=='__main__':
    with open(r'파일경로','r') as fa:
        seq=str(next(SeqIO.parse(fa,'fasta')).seq)

mer={}
making_mer('')
ans=[]
for i in range(len(seq)-3):
    s=seq[i:i+4]
    mer[s]=mer[s]+1

for i in mer.values():
    print(i,end=' ')