Python_matplotlib_04(Pandas 시각화)

lsc99 2023. 9. 22. 16:37

1. Pandas 시각화

- 판다스 자체적으로 matplotlib 를 기반으로 한 시각화기능을 지원한다.
- Series나 DataFrame에 plot() 함수나 plot accessor를 사용한다.
- https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

1.1. plot()
kind 매개변수에 지정한 값에 따라 다양한 그래프를 그릴 수 있다.
kind : 그래프 종류 지정
- 'line' : line plot (default)
- 'bar' : vertical bar plot
- 'barh' : horizontal bar plot
- 'hist' : histogram
- 'box' : boxplot
- 'kde' : Kernel Density Estimation plot
- 'pie' : pie plot
- 'scatter' : scatter plot

사용할 Series 생성

s = pd.Series([100, 20, 70, 90, 150], index=['사과', '귤', '배', '복숭아', '딸기'])

일반적으로 plt 사용하여 그래프 그리는 경우

# plt 사용 경우
plt.bar(s.index, s)
plt.show()

Series에 plot() 함수를 사용하여 그래프 그리기

- pandas 자체에 matplotlib의 그래프에 대한 기능이 있어 가능하다.

s.plot(kind = 'bar') # index이름 (: label의 역할), value (: data의 역할)를 이용해서 그린다.
plt.title('제목')
plt.show()

Series에 plot accessor 함수를 사용하여 그래프 그리기

s.plot.bar() # plot accessor
plt.show()

1.2. 막대 그래프
index가 무슨 값인지를 가리키는 축으로 사용된다.

csv 파일 읽어오기 -> tips

tips = pd.read_csv('data/tips.csv')
tips.shape

컬럼 'sex'에 대한 값들의 개수 확인

tips['sex'].value_counts()

.plot(kind = 'bar')

plt.figure(figsize=(10, 10))
tips['sex'].value_counts().plot(kind = 'bar')
plt.title('성별 데이터수')
plt.ylabel('테이블수')
plt.xlabel('성별')
plt.show()

plot() 함수 내부에 설정할 수도 있다.

plt.figure(figsize=(10, 10))
tips['sex'].value_counts().plot(kind = 'bar'
                                ,title = '성별 테이블수'
                                ,ylabel = 'table count'
                                ,xlabel = '성별'
                                ,rot = 0 # rot : 글자의 각도
                                ,color = ['blue', '#ff23ad']
                               )
plt.show()

수평막대그래프

# rot : 라벨의 각도를 돌린다.
tips['smoker'].value_counts().plot(kind='barh' # 수평막대그래프
                                   ,rot=45
                                  )
plt.show()

두개의 분류 별로 그리기
- 여러개의 컬럼일 경우 수평 누적 막대그래프를 그린다.

# 흡연여부, 성별 테이블수
count_df = tips.pivot_table(index='smoker'
                            ,columns='sex'
                            ,values = 'tip'
                            ,aggfunc = 'count'
                           )
count_df

# index : 라벨
# 컬럼이 두개 이상일 경우 -> 각각 막대그래프를 그린다.
count_df.plot(kind = 'bar'
             ,rot = 0
             )
plt.show()

index = day 일때

# 요일, 흡연여부별 손님수 (index = day)
tips.pivot_table(index = 'day', columns = 'smoker', values = 'size', aggfunc = 'sum').plot(kind = 'bar');

index = smoker 일때

# index가 smoker일때
tips.pivot_table(index = 'smoker', columns = 'day', values = 'size', aggfunc = 'sum').plot(kind = 'bar')
plt.legend(bbox_to_anchor=(1,1), loc = 'upper left', title = '요일')
plt.show()

stacked = True -> 파이차트처럼 비율로 시각화

# index가 smoker일때, stacked = True -> 비율적인 측면에서 확인할 수 있다.
tips.pivot_table(index = 'smoker', columns = 'day', values = 'size', aggfunc = 'sum').plot(kind = 'bar', stacked = True)
plt.legend(bbox_to_anchor=(1,1), loc = 'upper left', title = '요일')
plt.show()

groupby 활용 -> 요일별 bill의 평균

tips.groupby('day')['total_bill'].mean().sort_values(ascending = False).plot(kind = 'bar', title = '요일별 bill의 평균')

1.3. 파이차트

tips['day'].value_counts(normalize=True) # 비율 출력

day_cnt = tips['day'].value_counts()
day_cnt

day_cnt.plot(kind = 'pie'
             ,autopct = '%.2f%%'
             ,fontsize = 10
             ,explode = [0, 0, 0.1, 0]
             ,shadow = True
            )
plt.show()

day_cnt.plot.pie(autopct = '%.2f%%'
             ,fontsize = 10
             ,explode = [0, 0, 0.1, 0]
             ,shadow = True
            )
plt.show()

1.4. 히스토그램, KDE(밀도그래프)

tips['total_bill'].plot(kind='hist'
                        # , bins = 30 # 30개로 구역을 나눈다.
                        ,bins = [0, 10, 20, 25, 30, 50] # 원하는 대로 구분할 지점을 정환다.
                       )

# DataFrame -> 컬럼별
tips[['total_bill', 'tip']].plot.hist(bins = 20, alpha = 0.5)

연산처리를 위해 pip install scipy 실행

pip install scipy

커널 밀도 추정 함수

# 커널 밀도 추정 함수
tips['total_bill'].plot(kind = 'kde');

column이 2개인 것은 따로 그려준다.

tips[['total_bill', 'tip']].plot.kde()

1.5. Boxplot (상자그래프)

tips['total_bill'].plot(kind='box'
                        ,whis = 1
                       )
plt.show()

column이 2개인 것은 따로 그려준다.

tips[['total_bill', 'tip']].plot.box()

1.6. scatter plot (산점도)

total_bill과 tip을 묶어서 분포 확인

# DataFrame.plot(kind = 'scatter', x = '컬럼명', y = '컬럼명')
tips.plot(kind = 'scatter', x = 'total_bill', y = 'tip')

total_bill과 tip의 상관계수 확인 -> corr()

tips[['total_bill', 'tip']].corr()

1.7. 파이썬 날짜/시간 타입 - datetime 모듈

from datetime import datetime # 날짜시간 -> 일시 타입
from datetime import date # 날짜
from datetime import time # 시간

# 객체 생성
d = date(year = 2023
         ,month = 9
         ,day = 10
        )
t = time(hour = 21
         ,minute = 23
         ,second = 20
        )
dt = datetime(year = 2023
             ,month = 9
             ,day = 10
             ,hour = 21
             ,minute = 23
             ,second = 20
             )
print(d)
print(t)
print(dt)

# 실행시점을 이용해서 생성
today = date.today()
ct = datetime.now()
print(today)
print(ct)

today.~

# 년, 월, 일 각각 뽑기
today.year, today.month, today.day

ct.~

# 년, 월, 일, 시, 분, 초 각각 뽑기
ct.year, ct.month, ct.day, ct.hour, ct.minute, ct.second

strftime() -> 날짜, 시간, 일시를 원하는 형식의 문자열로 변환

# 날짜, 시간, 일시 -> 원하는 형식의 문자열로 변환
# strftime() 메소드
s = ct.strftime('%Y년 %m월 %d일 %H시 %M분 %S초')
print(type(s))
s

strptime() -> 일시형식의 문자열을 datetime 으로 변환

# 일시형식의 문자열을 datetime 으로 변환
# strptime
dt = datetime.strptime('2023년 09월 22일 15시 09분 14초', '%Y년 %m월 %d일 %H시 %M분 %S초')
print(type(dt))
dt

날짜 계산

- timedelta(weeks, days, hours, minutes, seconds, microseconds(1/100만 초)중에 하나 = 값)

# 일/시에서 특정 일시를 빼거나 더한 날짜를 계산.
# datetime.timedelta : 계산하려는 일시
from datetime import timedelta

now = datetime.now()
# 3일(timedelta) 후(+)
now + timedelta(days=3)
# 3일(timedelta) 전(-)
now - timedelta(days=3)

# weeks, days, hours, minutes, seconds, microseconds(1/100만 초)
# 3주 후의 날짜
print(now + timedelta(weeks = 3))
print(now + timedelta(days = 3, hours = 10))

판다스 - Timestamp
일시 타입. (날짜와 시간을 나노초(1/10억 초) 단위로 관리

import pandas as pd

pd.Timestamp(year = 2010, month = 12, day = 21)
pd.Timestamp(year = 2010, month = 12, day = 21, hour = 11, minute = 30)
pd.Timestamp(year = 2010, month = 12, day = 21, hour = 11, minute = 30, second = 22)

날짜와 시간

# 날짜 : - 구분, 시간은 : 구분
pd.Timestamp('2000-10-22')
pd.Timestamp('2000-10-22 10:22:33')

문자열들을 datetime type으로 변환

s = pd.Series(['2000-10-10', '2000-10-12', '2000-11-10', '2010-10-11', '2004-11-5'])
print(s.dtype)
# 날짜 형식의 문자열들의 Series -> type : Timestamp
s2 = pd.to_datetime(s)
s2

accessor : dt -> Series가 Timestamp일 때 사용할 수 있는 속성/메소드을 제공

print(s2.dt.year)
print(s2.dt.month)
s2.dt.day

특정 기간만큼 지나거나 전의 날짜를 계산

20일 후의 날짜 일괄 계산

s2 + pd.Timedelta(days=20)

5주 10일 10시간 전의 날짜 일괄 계산

s2 - pd.Timedelta(weeks = 5, days = 10, hours = 10)

특정간격으로 떨어져 있는 날짜와 시간을 원하는 수 만큼 만들기

- date_range() -> index를 만들때 많이 사용한다.

# freq -> 간격기간. 정수 -> 기간표시문자 ex) 5M : 5개월, -4Y : 4년전
# 기간표시문자 : Y, M, D, H, T(분), S
# 기간표시문자 뒤에 S를 붙이는 경우. ex) YS, MS - 시작 일시를 포함한다. S가 없으면 마지막 일시
# periods -> 개수
pd.date_range('2023-1-1', freq = 'MS', periods = 5)

pd.date_range('2023-1-1', freq = 'Y', periods = 5)

pd.date_range('2023-1-1', freq = '-5YS', periods = 5)

새로운 DataFrame 생성, 값 확인

index = pd.date_range('2023-1-1', freq = 'D', periods = 10)
value = np.random.randint(1, 100, size = (10, 3))
df = pd.DataFrame(value, columns = ['NO1', 'NO2', 'NO3'], index = index)
df

엑셀 파일 읽어오는 DataFrame, 정렬

df = pd.read_excel('data/강수량.xlsx')
df = df.set_index('계절').T
df.rename_axis('년도', axis = 0, inplace = True)
df

날짜형태 변환 (format)

df.index = pd.to_datetime(df.index, format = '%Y')
df

그래프로 변환흐름 확인

# lineplot으로 변환흐름 확인
df.loc['2009-01-01'].plot(kind = 'line') # (kind = 'line') -> default = line 이여서 생략해도 된다.

df['봄'].plot(figsize = (10, 5))

두 개 column에 대한 그래프

df[['봄', '겨울']].plot(figsize = (10, 4))

계절별(모든 계절) 그래프

df.plot(figsize = (10, 4))