12월 2018 ~ 개발 공부

[Python] 네이버 밴드 크롤링 + 글자 수 세기

Laptop

운영체제	Windows 10 Home 64bit
개발환경	Python 3.4.4 PyCharm 2018.1.3 (Community Edition)

[2019.07.21 멤버 구분 기능 추가 내용 수정]

Naver_Band_Comment_Count_Crawler.exe

친구 요청으로 간단하게 만들어본 프로그램이다.

지금 올리는 것은 작성글 보기 페이지만을 크롤링하는 특정 목적이 있는 프로그램이지만,
다른 페이지를 크롤링할 때도 태그만 수정하면 되니 동일한 절차를 거치면 될 것 같다.

Selenium 및 라이브러리를 활용하여, 로그인만 해주면 댓글 페이지의 댓글을 수집, 각각의 공백 포함 및 미포함 글자 수를 세어 리스트를 작성해 엑셀 파일로 출력한다.
또한 멤버명을 태그하여 언급하여 한 댓글에 여러명에게 얘기할 시 여러개의 댓글로 인식하여 결과를 낸다.

로그인도 자동화시키고 싶었지만 네이버 자체 보안 절차 때문에 불가능한 것 같다.

프로그램 실행 전에 두 가지 작업이 필요하다.

1. chromedriver 설치

selenium에 지원하는 버전 제한이 있는것 같아 내가 썼던 버전 링크를 적겠다.
https://chromedriver.storage.googleapis.com/index.html?path=72.0.3626.69/

여기 들어가 window용을 다운받아 압축을 풀면 chromedriver.exe라는 파일이 있다.
(아래 그림 참고)

chromedriver.exe를 위에서 첨부한 프로그램과 동일한 폴더에 위치시키면 된다.

2. URL 확인

확인할 페이지(여기서는 작성글 보기)의 링크 주소가 필요하다.
접속 방법은 아래 캡쳐한 메뉴대로 들어가면 된다.

필요한 값은 바로 위 스크린샷의 URL부분이다.

프로그램을 실행시키면 URL를 입력하는 란이 있으니 그곳에 해당 값을 입력하고,
START 버튼을 누르면 아래 그림과 같이 크롬 창이 새로 하나 실행되며 로그인 화면이 나온다.

참고할 점은 로그인할 때 이 절차가 보통 크롬으로 접속할 때보다 유독 좀 길다.
아무래도 프로그래밍적으로 접근하고 있다는 걸 인식해서 그런것 같다.

어쨌든 페이지가 완전히 나오면 OK버튼 클릭하면 알아서 스크롤을 내리면서 모두 조회한 후 엑셀 파일을 생성한다.

그리고 마지막으로 저장 창이 뜬다. 기본 파일명은 Naver_Band_Comment_Count_Result.xlsx로 설정되어 있다.

테스트로 쓴 댓글 구조는 다음과 같으며, 댓글과 멤버에 대한 매칭은 나온 순서로만 본다. 즉, [멤버1댓글1]과 [댓글1멤버1] 은 동일게 취급한다. (아래 결과 참조)

또한 연속으로 붙여적은 멤버는 모두 한 댓글에 대한 언급으로 취급한다.

그렇게 저장된 파일 형태는 다음과 같다.

전체 소스는 다음과 같다.

from selenium import webdriver
from tkinter import messagebox
import time
from openpyxl import Workbook

from tkinter import *
from tkinter.filedialog import asksaveasfilename

root = Tk()

root.title('Naver Band Comment Count Crawler')

url = StringVar()

lbl = Label(root, text="URL")
lbl.grid(sticky="W", row=0, column=0)
txt = Entry(root, textvariable=url, width=60)
txt.grid(sticky="W", row=0, column=1)

def enter(event):
    startCommentCount()

def saveFile(driver, wb):
    # 저장 파일명 설정
    file = asksaveasfilename(initialfile="Naver_Band_Comment_Count_Result.xlsx")
    if (file):
        try:
            wb.save(file)
        except:
            messagebox.showerror(title='Error', message=file + " 를 저장할 수 없습니다.\n파일이 열려있다면 종료해 주세요.")
            saveFile(driver, wb)
        else:
            messagebox.showinfo(title='Complete', message=file + " 저장 완료!")
            wb.close()
            driver.quit()
    # 취소 선택 시
    else:
        if(not messagebox.askyesno("Cancel", "파일이 저장되지 않습니다. 정말로 취소하시겠습니까?")):
            saveFile(driver, wb)
        else:
            wb.close()
            driver.quit()

def startCommentCount():

    if(url.get() == ''):
        messagebox.showinfo(title='Info', message='URL을 입력하세요.')
        return

    try:
        driver = webdriver.Chrome("chromedriver")
    except:
        messagebox.showerror(title='Error', message='chromedriver가 같은 위치에 있는지 확인하세요.')
        return

    try:
        driver.get(url.get())
    except:
        messagebox.showwarning(title='Warning', message='입력하신 URL에 연결 실패했습니다.\n다시 확인해 주세요.')
        driver.quit()
        return

    # 로그인 대기
    messagebox.showinfo(title='로그인', message='로그인 완료 후 페이지가 나타나면 OK 버튼을 눌러주세요.')

    # 더이상 refresh가 되지 않을 때까지 스크롤 내리기
    SCROLL_PAUSE_TIME = 0.5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

    # 댓글 리스트 크롤링
    comment_list = driver.find_elements_by_css_selector("p[class='comment']")

    # 엑셀 작성
    wb = Workbook()
    ws = wb.active

    ws.cell(row=1, column=1).value = "순번"
    ws.cell(row=1, column=2).value = "태그 멤버"
    ws.cell(row=1, column=3).value = "댓글"
    ws.cell(row=1, column=4).value = "공백 포함 글자수"
    ws.cell(row=1, column=5).value = "공백 미포함 글자수"

    i = 0
    row = 2

    for comment in comment_list:

        comment_text_list = []
        member_text_list = []

        # 언급 member 유무 확인
        member_list = comment.find_elements_by_tag_name("strong")

        if (member_list.__len__() == 0):
            member_text_list.append("")
            comment_text_list.append(comment.text)
        else:
            for member in member_list:
                member_text_list.append(member.text)

            comment_text = comment.get_attribute('innerHTML')

            # 다중 언급 댓글 분리
            m = 0
            curr = 0
            while (True):
                # strong 태그 시작을 찾아 마지막 탐색 지점과의 간격을 구함
                start = comment_text.find("<strong>", curr)
                diff = start - curr

                if (start == -1):
                    # 더이상 멤버가 없을 경우, 댓글로 끝나는 경우를 처리(추가)하고 종료
                    if (curr + 9 != len(comment_text)):
                        comment_text_list.append(comment_text[curr + 9:len(comment_text)].strip())
                    break

                if (curr == 0):
                    # 처음에 댓글 후 멤버명 언급 시 댓글 처리
                    if (comment_text[curr:start].strip() != ''):
                        comment_text_list.append(comment_text[curr:start].strip())
                else:
                    # 멤버명끼리 붙어있는 경우 (닫는 태그 길이 = 9)
                    if (diff == 9):
                        member_text_list[m] = member_text_list[m] + ", " + member_text_list[m + 1]
                        member_text_list.pop(m + 1)
                    # 댓글인 경우, 닫는 태그를 제외하고 내용 추출
                    else :
                        comment_text_list.append(comment_text[curr + 9:start].strip())
                        m += 1

                # 마지막 탐색 지점을 strong 태그 닫힌 구간으로 변경
                end = comment_text.find("</strong>", start)
                curr = end

        # 엑셀 줄 입력
        j = 1
        m = 0

        for comment in comment_text_list:

            # 다중 언급 댓글 번호 매기기
            child = ''
            if (comment_text_list.__len__() > 1):
                child = "-" + str(j)
                j += 1

            ws.cell(row=row, column=1).value = str(i + 1) + child
            ws.cell(row=row, column=2).value = member_text_list[m]
            ws.cell(row=row, column=3).value = comment
            ws.cell(row=row, column=4).value = len(comment)
            ws.cell(row=row, column=5).value = len(re.findall("[\S]", comment))

            m += 1
            row += 1

        i += 1

    saveFile(driver, wb)

root.bind('<Return>', enter)

btn = Button(root, text="START", width=15, command=startCommentCount)
btn.grid(row=3, column=1)

root.mainloop()

개발 공부

프로필

Translate

2018년 12월 23일 일요일

[Python] 네이버 밴드 크롤링 + 글자 수 세기

Popular Posts

Tag

Category

Blog Archive