Entrez 모듈로 GenBank에 접근하기

곰탱이장 2024. 11. 24. 09:54

Biopython의 Entrez 모듈로는 NCBI같은 데이터베이스에 코드로 직접 연결할 수 있게 해준다.

간단하게 설명하자면 먼저 우리는 NCBI에서 문제가 생겼을 때 연락받을 수 있는 email을 적어줘야한다. 그리고 handle을 통해 찾고 싶은 것을 적고 이 handle을 search에 넣으면 어느정도 찾아지는 그런 원리이다. 말로만 설명하면 어려우니 코드를 함께 봐보자

from Bio import Entrez

if __name__ == '__main__':
    with open(r"파일경로",'r') as f:
        Entrez.email = 'your_name@your_address.com'

        fs = []
        for i in f.readlines():
            fs.append(i.rstrip())

        genus_name=fs[0]
        date1=fs[1]
        date2=fs[2]

        handle = Entrez.esearch(db='nucleotide', term=f"{genus_name}[Organism]",mindate=date1, maxdate=date2,datetype='pdat')
        record = Entrez.read(handle)
        print(record["Count"])

위 코드는 https://rosalind.info/problems/gbk/ 의 문제의 코드이다. 위 코드를 보면 먼저 Bio에서 Entrez모듈을 불러온 것을 볼 수 있다.

from Bio import Entrez

이후 Entez.email을 이메일 주소로 설정해놓은 것을 볼 수 있다.

그리고 hadle을 Entrez.esearch()로 하여 설정해놓은 것을 볼 수 있다.

handle = Entrez.esearch(db='nucleotide', term=f"{genus_name}[Organism]",mindate=date1, maxdate=date2,datetype='pdat')

이때 db= 는 말그대로 어느 필드에서 검색을 할 것인지, term= 은 검색엔진에 직접 넣은 문자열을 mindate,maxdate는 최소,최대 날짜, datetype= 은 어떠한 날짜를 기준으로 검색할지를 나타낸다. 이와 같은 것들을 파라미터라 하며 자주 쓰는 Entrez.esearch()의 파라미터들은 다음과 같다.

Parameter	Description	Example
`db`	Database to search	`db="nucleotide"`
`term`	Search term	`term="BRCA1[Gene]"`
`retmax`	Maximum number of results to return	`retmax=10`
`retstart`	Offset for the results	`retstart=0`
`retmode`	Return data format (`xml` or `json`)	`retmode="xml"`
`rettype`	Type of data to return (`uilist`, `count`, etc.)	`rettype="uilist"`
`mindate`	Start date for filtering	`mindate="2023/01/01"`
`maxdate`	End date for filtering	`maxdate="2023/12/31"`
`datetype`	Filter by specific date type (`pdat`, `mdat`)	`datetype="pdat"`

그리고 이 handle을 Entrez.read() 함수에 넣어준다면 Entrez에서 저 handle을 기반으로 검색을 하여 검색에 맞는 것들이 모인 dictionary를 준다. 이 딕셔너리는 아래와 같이 구성되어 있다.

Explanation:

Count: Total number of results matching the query.
RetMax: Maximum number of results returned (5 in this case).
RetStart: Starting point for the returned results (default is 0).
IdList: List of unique identifiers (IDs) for the search results.
QueryTranslation: How the query was interpreted by Entrez.

이와 같이 esearch handle을 이용하여 read를 하여 검색한 것을 기반으로 ID들을 뽑아내는 방법이 있는 한편, id를 이미 아는 상태에서 efetch handle을 이용하여 read를 하면 그 아이디에서의 모든 정보를 얻을 수 있는 방법이 있다.

Entrez.efetch() 함수는 db와 id를 필수적으로 받아야한다.

Entrez.efetch(db, id)

이러한 Entrez.efetch의 자주 쓰는 파라미터들은 아래와 같다

| Parameter | Description | Example |
|--------------------|---------------------------------------------------|-----------------------------------|
| db | NCBI database to query (e.g., nucleotide, protein, pubmed). | db="nucleotide" |
| id | Unique identifier(s) (UIDs) of the records to fetch. | id="2236918015" or id="2236918015,2236918014" |
| rettype | Format of the returned data. Common options depend on the database. | rettype="fasta", rettype="gb", rettype="medline" |
| retmode | Mode of the returned data. Common values: text, xml, asn.1. | retmode="text", retmode="xml" |
| seq_start | For sequences, the starting position of the region to retrieve. | seq_start=100 |
| seq_stop | For sequences, the ending position of the region to retrieve. | seq_stop=200 |
| strand | For DNA sequences, specify strand to retrieve (1: forward, 2: reverse). | strand=1 |
| complexity | Protein sequence complexity (used for Entrez protein database queries). | complexity=0 |
| retstart | Record index to start fetching (for pagination). | retstart=5 |
| retmax | Maximum number of records to return.

이러한 Entrez 모듈을 이용하면 손쉽게 GenBank 정보에 접근하여 쉽게 가공할 수 있는 형태로 얻을 수 있을 듯 하다.

이러한 Entrez.efetch() 함수를 활용한 문제가 https://rosalind.info/problems/frmt/ 이다.

이 문제는 여러 ID가 주어지고 이 중 가장 짧은 서열을 가진 ID를 fasta 형식으로 그대로 파일로 쓰는 문제이다.

from Bio import Entrez
from Bio import SeqIO

if __name__ == '__main__':
    with open(r'파일경로','r') as f:
        Entrez.email = 'jangbear1109@gamil.com'

        IDs = list(f.readline().rstrip().split())
        
        handle = Entrez.efetch(db='nucleotide',id=IDs,rettype='fasta')
        records = list(SeqIO.parse(handle,'fasta'))

        min_len=float('inf')
        min_handle=''
        for idx,seq in enumerate(records):
            if len(seq.seq) < min_len:
                min_len=len(seq.seq)
                min_handle = seq

    with open(r'파일경로','w') as wf:
        SeqIO.write(min_handle,wf,'fasta')

이 때는 여러 개 이기에, Entrez.read()가 아닌 SeqIO.parse() 함수를 쓰는 것을 볼 수 있다. 그리고 그 다음은 그냥 단순한 크기 비교이고, SeqIO를 활용하여 fasta 형식으로 파일을 쓰면 된다.

이와 같이 Entrez 모듈과 함께라면 우리는 GenBank 데이터베이스에 코드로 데이터를 추출하여 다룰 수 있다.