[K리그 데이터 분석] 2. xG 관련 데이터 수집 크롤러 구현

[K리그 ETL 파이프라인] 1. K리그 데이터 포털 크롤러 구현

K리그 축구 전문 데이터 포털K리그의 모든 것이 담겨있습니다. 경기전 관전포인트 부터 전문 분석 매치서머리까지 지금 방문해보세요. 로그인 없이 이용하실 수 있습니다.data.kleague.com 첫번째

jeongbeenson19.tistory.com

전 포스팅의 코드를 약간 수정해서 데이터센터-부가기록에 존재하는 xG 관련 기록을 크롤링하기 위한 코드를 작성했다.

xG는 현대 축구에서 공격수를 평가하는데 주요한 지표 중 하나라고 생각해서 더욱 유용한 시각화 자료를 만들 수 있을 것으로 예상한다.

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from bs4 import BeautifulSoup
import time
import csv
import os

DELAY = 3
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get(os.environ.get("K_LEAGUE_DATA_PORTAL"))


def xg_crawler(round_number):
    try:
        xg_data = ["순위,선수명,구단,출전수,출전시간(분),슈팅,득점,xG,득점/xG,90분당 xG"]
        xg_columns = xg_data[0].split(",")
        # K 리그 데이터 포털 데이터 센터 접속
        print("Connecting to data center...")
        time.sleep(DELAY)
        driver.execute_script("moveMainFrame('0011');")

        print("Connecting to Additional record...")
        time.sleep(DELAY)
        driver.execute_script("moveMainFrame('0194');")  # 부가기록
        driver.execute_script("setDisplayMenu('subMenuLayer', '0432')")  # 기대득점
        driver.execute_script("moveMainFrame('0433');")
        html = driver.page_source
        xg_soup = BeautifulSoup(html, "html.parser")
        xg_table = xg_soup.find("table")

        if xg_table:  # 테이블이 존재하는 경우에만 처리
            xg_rows = xg_table.find_all('tr')
            for row in xg_rows[1:]:
                cols = row.find_all(['td', 'th'])  # 'td'와 'th' 모두 찾기
                cols = [ele.text.strip() for ele in cols]  # 텍스트를 추출하고 공백 제거
                xg_data.append(cols)
        else:
            print("No data found")
    except NoSuchElementException as e:
        print(f"No such element: {e}")
    except TimeoutException as e:
        print(f"Timeout occurred: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        driver.quit()
    # csv파일로 저장
    xg_output_file = f"data/{round_number}-round-xg.csv"
    with open(xg_output_file, mode='w', newline='', encoding='utf-8-sig') as file:
        writer = csv.writer(file)
        writer.writerow(xg_columns)  # 컬럼 이름 작성
        writer.writerows(xg_data[1:])  # 데이터 작성
    print(f"Data has been written to \n{xg_output_file}")

코드를 작성하면서 주요한 특이사항으로는 이전 포스팅에서 크롤링한 데이터와 xG 데이터의 column name이 같은 데이터임에도 불구하고 데이터 포털에 다르게 작성되어 있었다.

두 크롤러 모두 data 리스트를 초기화하는 단계에서 column name을 리스트 첫 번째 요소로 넣고 크롤링된 column name은 리스 슬라이싱을 해서 제외시켰다.

덕분에 크롤링 코드를 수정하지 않고 초기화 단계 리스트를 수정하여 Transform 단계에서 두 데이터를 병합하는데 용이할 것으로 예상된다.

TODO:

수집한 데이터를 전처리한다.

다양한 시각화된 자료를 참고하여 Feature Engieneering을 진행한다.

전체 소스 코드는 여기서 확인하실 수 있습니다.

GitHub - jeongbeenson19/K-league-pipeline-project

Contribute to jeongbeenson19/K-league-pipeline-project development by creating an account on GitHub.

github.com

'FootballAnalysis' 카테고리의 다른 글

[K리그 데이터 분석] 4. Dashboard.py by Dash (0)	2024.09.10
[K리그 데이터 분석] 3-1. 변환 단계 리팩터링 (0)	2024.08.30
[K리그 데이터 분석] 3. 데이터 전처리 및 변환 (0)	2024.06.27
[K리그 데이터 분석] 1. K리그 데이터 포털 크롤러 구현 (1)	2024.05.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

"FAST-FOWARD"

[K리그 데이터 분석] 2. xG 관련 데이터 수집 크롤러 구현

'FootballAnalysis' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[K리그 데이터 분석] 2. xG 관련 데이터 수집 크롤러 구현

'FootballAnalysis' 카테고리의 다른 글

'FootballAnalysis' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역