[WebCrawling] 웹사이트를 직접 분석해보자 -3편

활동내역.zip/개인

[WebCrawling] 웹사이트를 직접 분석해보자 -3편

ThreeLight 2022. 12. 5. 00:00

728x90

Type: 데이터 수집 / 분석

주제: Web Crawling

사용 IDE: IntelliJ IDEA

사용 언어: Python

사용 패키지: bs4 - BeautifulSoup, requests - get

GitHub Link: https://github.com/TMInstaller/WebCrawling_Myblog

GitHub - TMInstaller/WebCrawling_Myblog: WebCrawling project demo

WebCrawling project demo. Contribute to TMInstaller/WebCrawling_Myblog development by creating an account on GitHub.

github.com

이전 편 보고오기

[WebCrawling] 내 블로그의 정보를 직접 분석해보자 -2편

Type: 데이터 수집 / 분석 주제: Web Crawling 사용 IDE: IntelliJ IDEA 사용 언어: Python 사용 패키지: bs4 - BeautifulSoup, requests - get GitHub Link: https://github.com/TMInstaller/WebCrawling_Myblog 더보기 GitHub - TMInstaller/WebCraw

time-map-installer.tistory.com

설계 - 어떤 기능을 추가할까?

이번 편의 목표는 다음과 같다

목표: 필요한 정보만 골라서 데이터를 한곳에 담아 출력해보자

이번 편에서는 이전 글에서 작성했던 코드를 개선하는 작업을 해 볼것이다

코드 작성

1. 크롤링한 데이터를 dict type에 저장

2. 일괄 출력하기

1. 크롤링한 데이터를 dict type에 저장

for post in key_posts:
    span = post.find('a')
    title = span.find('span', class_='title')
    meta = span.find('span', class_='meta')
    date = meta.find('span', class_='date')

    excerpt = span.find('span', class_='excerpt')
    key_data = {
        'title': title.string,
        'date': date.string,
        'prev': excerpt.string
    }
    results.append(key_data)

이전 글과 달라진 부분이 하나 있다

썸네일 이미지에 관한 내용을 빼기로 했다

추가적으로 date를 받아오려면 meta 안에 있는 값을 가져와야 하기에

date를 meta에서 찾아주는 코드를 작성하였다

후에 있을 작업을 더 편리하게 하기 위해 순서도 바꾸어 주었다

title - date - prev 순으로 순서를 바꾸었다

2. 일괄 출력하기

# for key_section in keys:

for result in results:
    print(result)
    print('////////////////')

for loop를 사용하여 모든 값들을 출력해준다

0. 전체 코드 / 실행 결과

# main.py
from requests import get
from bs4 import BeautifulSoup

url = "https://time-map-installer.tistory.com/search/"
keyword = "Python"
response = get(f"{url}{keyword}")
if response.status_code != 200:
    print("페이지를 불러올 수 없습니다", response.status_code)
else:
    # dict type의 결과를 출력하기 위한 빈 배열 생성
    results = []
    soup = BeautifulSoup(response.text, 'html.parser')
    keys = soup.find_all('div', class_='inner')
    for key_section in keys:
        key_posts = key_section.find_all('div', class_='post-item')
        for post in key_posts:
            span = post.find('a')
            # 이미지 출력값을 데이터에서 제외
            title = span.find('span', class_='title')
            # meta class 안에 date 클래스가 있으므로 한 번 더 찾아 준다
            meta = span.find('span', class_='meta')
            date = meta.find('span', class_='date')

            excerpt = span.find('span', class_='excerpt')
            # dict 형태로 key, value 값 담을 key_data 생성
            key_data = {
                'title': title.string,
                'date': date.string,
                'prev': excerpt.string
            }
            results.append(key_data)
    # for loop 밖에서 결과 실행
    for result in results:
        print(result)
        print('////////////////')

주석을 달아 둔 전체 코드와 실행결과이다

전보다 보기 편해진 결과가 나왔다

다음 편에서 계속됩니다

728x90

저작자표시 비영리 변경금지