EDA란?
수집한 데이터가 들어왔을 때, 이를 다양한 각도에서 관찰하고 이해하며 data의 feature을 파악하는 과정입니다. 주관적으로 데이터를 바라보는 것이 아닌 그래프,통계와 같은 방식들을 사용하여 직관적으로 데이터를 바라봐야합니다.
왜 EDA를 해야 하는가?
기본적으로 raw 데이터는 더럽습니다. 결측치(NAN)와 데이터의 타입 등 데이터 전처리를 진행해야 하며 data column명이 무엇을 의미하는지, 해당 데이터의 분 및 값들을 확인하면서 데이터 표현 방식을 더 잘 이해하고, 데이터들끼리의 상관관계를 바탕으로 새로운 data를 만들 수도 있기 때문입니다.
다양한 가설들을 세우고 이를 그래프나 통계적인 방식으로 확인하면서, 데이터들의 패턴을 파악 할 수 있습니다.
EDA 실습
kaggle에서 예시 data를 가져와 Amazon Sagemaker내에 노트북 인스턴스 환경에서 진행하였습니다. 추후 Amazon Sagemaker 쪽 서비스도 공부하며 정리하여 올리도록 하겠습니다.
데이터 출처:
https://www.kaggle.com/datasets/d44dbbc86cc4cf29d8184335785adb6b1dc6398af1d746825f63145a3bbd8b49
import os
import boto3 # Python library for Amazon API
import botocore
from botocore.exceptions import ClientError
def download_from_s3(url):
url_parts = url.split("/") # => ['s3:', '', 'sagemakerbucketname', 'data', ...
bucket_name = url_parts[2]
key = os.path.join(*url_parts[3:])
filename = url_parts[-1]
if not os.path.exists(filename):
try:
# Create an S3 client
s3 = boto3.resource('s3')
print('Downloading {} to {}'.format(url, filename))
s3.Bucket(bucket_name).download_file(key, filename)
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print('The object {} does not exist in bucket {}'.format(
key, bucket_name))
else:
raise
download_from_s3("s3://---/LSTM-Multivariate_pollution.csv")
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
file = pd.read_csv("LSTM-Multivariate_pollution.csv")
Column 설명
- date: date
- pollution : Air pollution
- dew: dew point(이슬점)
- press: Pressure
- wnd_dir : wind direction
- wnd_spd: wind speed
- snow, rain : cumulative number of housrs of snow and rain
df = pd.DataFrame(file)
df.head()
date | pollution | dew | temp | press | wnd_dir | wnd_spd | snow | rain | |
---|---|---|---|---|---|---|---|---|---|
0 | 2010-01-02 00:00:00 | 129.0 | -16 | -4.0 | 1020.0 | SE | 1.79 | 0 | 0 |
1 | 2010-01-02 01:00:00 | 148.0 | -15 | -4.0 | 1020.0 | SE | 2.68 | 0 | 0 |
2 | 2010-01-02 02:00:00 | 159.0 | -11 | -5.0 | 1021.0 | SE | 3.57 | 0 | 0 |
3 | 2010-01-02 03:00:00 | 181.0 | -7 | -5.0 | 1022.0 | SE | 5.36 | 1 | 0 |
4 | 2010-01-02 04:00:00 | 138.0 | -7 | -5.0 | 1022.0 | SE | 6.25 | 2 | 0 |
df.shape
(43800, 9)
df.describe()
pollution | dew | temp | press | wnd_spd | snow | rain | |
---|---|---|---|---|---|---|---|
count | 43800.000000 | 43800.000000 | 43800.000000 | 43800.000000 | 43800.000000 | 43800.000000 | 43800.000000 |
mean | 94.013516 | 1.828516 | 12.459041 | 1016.447306 | 23.894307 | 0.052763 | 0.195023 |
std | 92.252276 | 14.429326 | 12.193384 | 10.271411 | 50.022729 | 0.760582 | 1.416247 |
min | 0.000000 | -40.000000 | -19.000000 | 991.000000 | 0.450000 | 0.000000 | 0.000000 |
25% | 24.000000 | -10.000000 | 2.000000 | 1008.000000 | 1.790000 | 0.000000 | 0.000000 |
50% | 68.000000 | 2.000000 | 14.000000 | 1016.000000 | 5.370000 | 0.000000 | 0.000000 |
75% | 132.250000 | 15.000000 | 23.000000 | 1025.000000 | 21.910000 | 0.000000 | 0.000000 |
max | 994.000000 | 28.000000 | 42.000000 | 1046.000000 | 585.600000 | 27.000000 | 36.000000 |
dew temp pollution의 중앙값과 평균값차이가 심한 모습을 볼 수 있습니다.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43800 entries, 0 to 43799
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 43800 non-null object
1 pollution 43800 non-null float64
2 dew 43800 non-null int64
3 temp 43800 non-null float64
4 press 43800 non-null float64
5 wnd_dir 43800 non-null object
6 wnd_spd 43800 non-null float64
7 snow 43800 non-null int64
8 rain 43800 non-null int64
dtypes: float64(4), int64(3), object(2)
memory usage: 3.0+ MB
# 결측값 확인
df.isnull().values.any()
False
missed = pd.DataFrame()
missed['column'] = df.columns
missed['percent'] = [round(100* df[col].isnull().sum() / len(df), 2) for col in df.columns]
missed = missed.sort_values('percent',ascending=False)
print(missed)
column percent
0 date 0.0
1 pollution 0.0
2 dew 0.0
3 temp 0.0
4 press 0.0
5 wnd_dir 0.0
6 wnd_spd 0.0
7 snow 0.0
8 rain 0.0
df.tail()
date | pollution | dew | temp | press | wnd_dir | wnd_spd | snow | rain | |
---|---|---|---|---|---|---|---|---|---|
43795 | 2014-12-31 19:00:00 | 8.0 | -23 | -2.0 | 1034.0 | NW | 231.97 | 0 | 0 |
43796 | 2014-12-31 20:00:00 | 10.0 | -22 | -3.0 | 1034.0 | NW | 237.78 | 0 | 0 |
43797 | 2014-12-31 21:00:00 | 10.0 | -22 | -3.0 | 1034.0 | NW | 242.70 | 0 | 0 |
43798 | 2014-12-31 22:00:00 | 8.0 | -22 | -4.0 | 1034.0 | NW | 246.72 | 0 | 0 |
43799 | 2014-12-31 23:00:00 | 12.0 | -21 | -3.0 | 1034.0 | NW | 249.85 | 0 | 0 |
df['wnd_dir']=df['wnd_dir'].astype('string')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43800 entries, 0 to 43799
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 43800 non-null object
1 pollution 43800 non-null float64
2 dew 43800 non-null int64
3 temp 43800 non-null float64
4 press 43800 non-null float64
5 wnd_dir 43800 non-null string
6 wnd_spd 43800 non-null float64
7 snow 43800 non-null int64
8 rain 43800 non-null int64
dtypes: float64(4), int64(3), object(1), string(1)
memory usage: 3.0+ MB
df['wnd_dir'].unique()
<StringArray>
['SE', 'cv', 'NW', 'NE']
Length: 4, dtype: string
def wind_encode(s):
if s == "SE":
return 1
elif s == "NE":
return 2
elif s == "NW":
return 3
else:
return 4
df["wind_dir"] = df["wnd_dir"].apply(wind_encode)
df = df.drop(["wnd_dir"], axis=1)
df.head()
date | pollution | dew | temp | press | wnd_spd | snow | rain | wind_dir | |
---|---|---|---|---|---|---|---|---|---|
0 | 2010-01-02 00:00:00 | 129.0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 | 1 |
1 | 2010-01-02 01:00:00 | 148.0 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 | 1 |
2 | 2010-01-02 02:00:00 | 159.0 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 | 1 |
3 | 2010-01-02 03:00:00 | 181.0 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 | 1 |
4 | 2010-01-02 04:00:00 | 138.0 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 | 1 |
df['date']= pd.to_datetime(df['date'],format = '%Y-%m-%d %H:%M:%S') # object값 datetime으로 변경
df.set_index("date", inplace = True) # index로 설정
df
pollution | dew | temp | press | wnd_spd | snow | rain | wind_dir | |
---|---|---|---|---|---|---|---|---|
date | ||||||||
2010-01-02 00:00:00 | 129.0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 | 1 |
2010-01-02 01:00:00 | 148.0 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 | 1 |
2010-01-02 02:00:00 | 159.0 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 | 1 |
2010-01-02 03:00:00 | 181.0 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 | 1 |
2010-01-02 04:00:00 | 138.0 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-31 19:00:00 | 8.0 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 |
2014-12-31 20:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 |
2014-12-31 21:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 |
2014-12-31 22:00:00 | 8.0 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 |
2014-12-31 23:00:00 | 12.0 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 |
43800 rows × 8 columns
구간별 pollution을 알 수 있도록 datetime 구간 나누기
df['year'] = pd.DatetimeIndex(df.index).year
df['month'] = pd.DatetimeIndex(df.index).month
df['day'] = pd.DatetimeIndex(df.index).day
df['hour'] = pd.DatetimeIndex(df.index).hour
df
pollution | dew | temp | press | wnd_spd | snow | rain | wind_dir | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||
2010-01-02 00:00:00 | 129.0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 | 1 | 2010 | 1 | 2 | 0 |
2010-01-02 01:00:00 | 148.0 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 | 1 | 2010 | 1 | 2 | 1 |
2010-01-02 02:00:00 | 159.0 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 | 1 | 2010 | 1 | 2 | 2 |
2010-01-02 03:00:00 | 181.0 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 | 1 | 2010 | 1 | 2 | 3 |
2010-01-02 04:00:00 | 138.0 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 | 1 | 2010 | 1 | 2 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-31 19:00:00 | 8.0 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 | 2014 | 12 | 31 | 19 |
2014-12-31 20:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 | 2014 | 12 | 31 | 20 |
2014-12-31 21:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 | 2014 | 12 | 31 | 21 |
2014-12-31 22:00:00 | 8.0 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 | 2014 | 12 | 31 | 22 |
2014-12-31 23:00:00 | 12.0 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 | 2014 | 12 | 31 | 23 |
43800 rows × 12 columns
df_2010 = df[df['year']==2010]
df_2011 = df[df['year']==2011]
df_2012 = df[df['year']==2012]
df_2013 = df[df['year']==2013]
df_2014 = df[df['year']==2014]
df_2014
pollution | dew | temp | press | wnd_spd | snow | rain | wind_dir | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||
2014-01-01 00:00:00 | 24.0 | -20 | 7.0 | 1014.0 | 143.48 | 0 | 0 | 3 | 2014 | 1 | 1 | 0 |
2014-01-01 01:00:00 | 53.0 | -20 | 7.0 | 1013.0 | 147.50 | 0 | 0 | 3 | 2014 | 1 | 1 | 1 |
2014-01-01 02:00:00 | 65.0 | -20 | 6.0 | 1013.0 | 151.52 | 0 | 0 | 3 | 2014 | 1 | 1 | 2 |
2014-01-01 03:00:00 | 70.0 | -20 | 6.0 | 1013.0 | 153.31 | 0 | 0 | 3 | 2014 | 1 | 1 | 3 |
2014-01-01 04:00:00 | 79.0 | -18 | 3.0 | 1012.0 | 0.89 | 0 | 0 | 4 | 2014 | 1 | 1 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-31 19:00:00 | 8.0 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 | 2014 | 12 | 31 | 19 |
2014-12-31 20:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 | 2014 | 12 | 31 | 20 |
2014-12-31 21:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 | 2014 | 12 | 31 | 21 |
2014-12-31 22:00:00 | 8.0 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 | 2014 | 12 | 31 | 22 |
2014-12-31 23:00:00 | 12.0 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 | 2014 | 12 | 31 | 23 |
8760 rows × 12 columns
pollution 추세 시각화
fig, ax = plt.subplots(nrows=5)
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,18)})
sns.lineplot(data = df_2010, x = df_2010.index, y = 'pollution',label='pollution',ax=ax[0])
sns.lineplot(data = df_2010, x = df_2010.index, y = df_2010['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[0])
sns.lineplot(data = df_2011, x = df_2011.index, y = 'pollution',label='pollution',ax=ax[1])
sns.lineplot(data = df_2011, x = df_2011.index, y = df_2011['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[1])
sns.lineplot(data = df_2012, x = df_2012.index, y = 'pollution',label='pollution',ax=ax[2])
sns.lineplot(data = df_2012, x = df_2012.index, y = df_2012['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[2])
sns.lineplot(data = df_2013, x = df_2013.index, y = 'pollution',label='pollution',ax=ax[3])
sns.lineplot(data = df_2013, x = df_2013.index, y = df_2013['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[3])
sns.lineplot(data = df_2014, x = df_2014.index, y = 'pollution',label='pollution',ax=ax[4])
sns.lineplot(data = df_2014, x = df_2014.index, y = df_2014['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[4])
plt.show()
)
Pollution
연간 합계량 도출
sns.set_style('whitegrid')
width = 0.5
sns.set(rc={'figure.figsize':(16,8)})
ax = sns.barplot(x=df["year"], y=df["pollution"], data=df, estimator=np.sum)
for bar in ax.patches:
x = bar.get_x() # 막대 좌측 하단 x 좌표
old_width = bar.get_width() # 기존 막대 폭
bar.set_width(width) # 폭변경
bar.set_x(x+(old_width-width)/2) # 막대 좌측 하단 x 좌표 업데이트
plt.legend(loc =1,prop={'size':8})
plt.show()
No handles with labels found to put in legend.
fig, ax = plt.subplots(nrows=5)
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(4,1)})
sns.barplot(data = df_2010, x = df_2010['month'], y = df_2010['pollution'],label='pollution',ax=ax[0],estimator=np.sum)
sns.barplot(data = df_2011, x = df_2011['month'], y = df_2011['pollution'],label='pollution',ax=ax[1],estimator=np.sum)
sns.barplot(data = df_2012, x = df_2012['month'], y = df_2012['pollution'],label='pollution',ax=ax[2],estimator=np.sum)
sns.barplot(data = df_2013, x = df_2013['month'], y = df_2013['pollution'],label='pollution',ax=ax[3],estimator=np.sum)
sns.barplot(data = df_2014, x = df_2014['month'], y = df_2014['pollution'],label='pollution',ax=ax[4],estimator=np.sum)
plt.show()
temp
& dew 추세 시각화
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,8)})
sns.lineplot(data = df, x = df.index, y = 'temp',label='temp')
sns.lineplot(data = df, x = df.index, y = df['temp'].rolling(window=30).mean(),label='temp 30 Day Avg')
sns.lineplot(data = df, x = df.index, y = 'dew',label='dew')
sns.lineplot(data = df, x = df.index, y = df['dew'].rolling(window=30).mean(),label='dew 30 Day Avg')
plt.legend(loc =1,prop={'size':8})
plt.show()
press 시각화
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,8)})
sns.lineplot(data = df, x = df.index, y = 'press',label='press')
sns.lineplot(data = df, x = df.index, y = df['press'].rolling(window=30).mean(),label='press 30 Day Avg')
plt.legend(loc =1,prop={'size':8})
plt.show()
Wnd_spd 시각화
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,8)})
sns.lineplot(data = df, x = df.index, y = 'wnd_spd',label='wnd_s')
sns.lineplot(data = df, x = df.index, y = df['wnd_spd'].rolling(window=30).mean(),label='wnd_s 30 Day Avg')
plt.legend(loc =1,prop={'size':8})
plt.show()
snow and rain 시각화
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,8)})
sns.lineplot(data = df, x = df.index, y = 'snow',label='snow')
sns.lineplot(data = df, x = df.index, y = 'rain',label='rain')
plt.legend(loc =1,prop={'size':8})
plt.show()
데이터 이상치 제거
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(18,8)})
sns.boxplot(data = df, x = df['year'], y = 'pollution',color='red')
plt.show()
def outliers_iqr(data):
q1, q3 = np.percentile(data,[25,75])
iqr = q3- q1
lower_bound = q1 - (iqr * 1.5)
upper_bound = q3 + (iqr * 1.5)
data = (data > upper_bound) | (data < lower_bound)
return df[data]
df_outlier_pollution = outliers_iqr(df['pollution'])
a = df_outlier_pollution.index
df.drop(a,inplace=True)
DatetimeIndex(['2010-01-17 21:00:00', '2010-01-17 23:00:00',
'2010-01-18 02:00:00', '2010-01-18 03:00:00',
'2010-01-18 04:00:00', '2010-01-18 05:00:00',
'2010-01-18 19:00:00', '2010-01-18 20:00:00',
'2010-01-18 21:00:00', '2010-01-18 22:00:00',
...
'2014-12-28 02:00:00', '2014-12-28 03:00:00',
'2014-12-28 04:00:00', '2014-12-28 22:00:00',
'2014-12-28 23:00:00', '2014-12-29 00:00:00',
'2014-12-29 01:00:00', '2014-12-29 02:00:00',
'2014-12-29 03:00:00', '2014-12-29 04:00:00'],
dtype='datetime64[ns]', name='date', length=1878, freq=None)
이상치 값 제거후 결과값 확인
df
pollution | dew | temp | press | wnd_spd | snow | rain | wind_dir | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||
2010-01-02 00:00:00 | 129.0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 | 1 | 2010 | 1 | 2 | 0 |
2010-01-02 01:00:00 | 148.0 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 | 1 | 2010 | 1 | 2 | 1 |
2010-01-02 02:00:00 | 159.0 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 | 1 | 2010 | 1 | 2 | 2 |
2010-01-02 03:00:00 | 181.0 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 | 1 | 2010 | 1 | 2 | 3 |
2010-01-02 04:00:00 | 138.0 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 | 1 | 2010 | 1 | 2 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-31 19:00:00 | 8.0 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 | 2014 | 12 | 31 | 19 |
2014-12-31 20:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 | 2014 | 12 | 31 | 20 |
2014-12-31 21:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 | 2014 | 12 | 31 | 21 |
2014-12-31 22:00:00 | 8.0 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 | 2014 | 12 | 31 | 22 |
2014-12-31 23:00:00 | 12.0 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 | 2014 | 12 | 31 | 23 |
41922 rows × 12 columns
df_2010 = df[df['year']==2010]
df_2011 = df[df['year']==2011]
df_2012 = df[df['year']==2012]
df_2013 = df[df['year']==2013]
df_2014 = df[df['year']==2014]
df_2014
pollution | dew | temp | press | wnd_spd | snow | rain | wind_dir | year | month | day | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||
2014-01-01 00:00:00 | 24.0 | -20 | 7.0 | 1014.0 | 143.48 | 0 | 0 | 3 | 2014 | 1 | 1 | 0 |
2014-01-01 01:00:00 | 53.0 | -20 | 7.0 | 1013.0 | 147.50 | 0 | 0 | 3 | 2014 | 1 | 1 | 1 |
2014-01-01 02:00:00 | 65.0 | -20 | 6.0 | 1013.0 | 151.52 | 0 | 0 | 3 | 2014 | 1 | 1 | 2 |
2014-01-01 03:00:00 | 70.0 | -20 | 6.0 | 1013.0 | 153.31 | 0 | 0 | 3 | 2014 | 1 | 1 | 3 |
2014-01-01 04:00:00 | 79.0 | -18 | 3.0 | 1012.0 | 0.89 | 0 | 0 | 4 | 2014 | 1 | 1 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2014-12-31 19:00:00 | 8.0 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 | 2014 | 12 | 31 | 19 |
2014-12-31 20:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 | 2014 | 12 | 31 | 20 |
2014-12-31 21:00:00 | 10.0 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 | 2014 | 12 | 31 | 21 |
2014-12-31 22:00:00 | 8.0 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 | 2014 | 12 | 31 | 22 |
2014-12-31 23:00:00 | 12.0 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 | 2014 | 12 | 31 | 23 |
8284 rows × 12 columns
fig, ax = plt.subplots(nrows=5)
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(14,18)})
sns.lineplot(data = df_2010, x = df_2010.index, y = 'pollution',label='pollution',ax=ax[0])
sns.lineplot(data = df_2010, x = df_2010.index, y = df_2010['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[0])
sns.lineplot(data = df_2011, x = df_2011.index, y = 'pollution',label='pollution',ax=ax[1])
sns.lineplot(data = df_2011, x = df_2011.index, y = df_2011['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[1])
sns.lineplot(data = df_2012, x = df_2012.index, y = 'pollution',label='pollution',ax=ax[2])
sns.lineplot(data = df_2012, x = df_2012.index, y = df_2012['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[2])
sns.lineplot(data = df_2013, x = df_2013.index, y = 'pollution',label='pollution',ax=ax[3])
sns.lineplot(data = df_2013, x = df_2013.index, y = df_2013['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[3])
sns.lineplot(data = df_2014, x = df_2014.index, y = 'pollution',label='pollution',ax=ax[4])
sns.lineplot(data = df_2014, x = df_2014.index, y = df_2014['pollution'].rolling(window=30).mean(),label='pollution 30 Day Avg',ax=ax[4])
plt.show()
fig, ax = plt.subplots(nrows=5)
sns.set_style('whitegrid')
sns.set(rc={'figure.figsize':(4,1)})
sns.barplot(data = df_2010, x = df_2010['month'], y = df_2010['pollution'],label='pollution',ax=ax[0],estimator=np.sum)
sns.barplot(data = df_2011, x = df_2011['month'], y = df_2011['pollution'],label='pollution',ax=ax[1],estimator=np.sum)
sns.barplot(data = df_2012, x = df_2012['month'], y = df_2012['pollution'],label='pollution',ax=ax[2],estimator=np.sum)
sns.barplot(data = df_2013, x = df_2013['month'], y = df_2013['pollution'],label='pollution',ax=ax[3],estimator=np.sum)
sns.barplot(data = df_2014, x = df_2014['month'], y = df_2014['pollution'],label='pollution',ax=ax[4],estimator=np.sum)
plt.show()
pollution data와 상관관계 분석
df1=df.reset_index(drop=True)
X=df1.drop('pollution', 1)
X=X.drop('day',1)
X=X.drop('hour',1)
X
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
This is separate from the ipykernel package so we can avoid doing imports until
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only
after removing the cwd from sys.path.
dew | temp | press | wnd_spd | snow | rain | wind_dir | year | month | |
---|---|---|---|---|---|---|---|---|---|
0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 | 1 | 2010 | 1 |
1 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 | 1 | 2010 | 1 |
2 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 | 1 | 2010 | 1 |
3 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 | 1 | 2010 | 1 |
4 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 | 1 | 2010 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41917 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 | 2014 | 12 |
41918 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 | 2014 | 12 |
41919 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 | 2014 | 12 |
41920 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 | 2014 | 12 |
41921 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 | 2014 | 12 |
41922 rows × 9 columns
y=df1[["pollution"]]
y
pollution | |
---|---|
0 | 129.0 |
1 | 148.0 |
2 | 159.0 |
3 | 181.0 |
4 | 138.0 |
... | ... |
41917 | 8.0 |
41918 | 10.0 |
41919 | 10.0 |
41920 | 8.0 |
41921 | 12.0 |
41922 rows × 1 columns
# Create correlation matrix
corr_matrix = X.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find index of feature columns with correlation greater than 0.6
to_drop = [column for column in upper.columns if any(upper[column] > 0.6)]
dataplot = sns.heatmap(corr_matrix,mask=np.triu(np.ones(corr_matrix.shape)))
plt.show()
X
dew | temp | press | wnd_spd | snow | rain | wind_dir | year | month | |
---|---|---|---|---|---|---|---|---|---|
0 | -16 | -4.0 | 1020.0 | 1.79 | 0 | 0 | 1 | 2010 | 1 |
1 | -15 | -4.0 | 1020.0 | 2.68 | 0 | 0 | 1 | 2010 | 1 |
2 | -11 | -5.0 | 1021.0 | 3.57 | 0 | 0 | 1 | 2010 | 1 |
3 | -7 | -5.0 | 1022.0 | 5.36 | 1 | 0 | 1 | 2010 | 1 |
4 | -7 | -5.0 | 1022.0 | 6.25 | 2 | 0 | 1 | 2010 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41917 | -23 | -2.0 | 1034.0 | 231.97 | 0 | 0 | 3 | 2014 | 12 |
41918 | -22 | -3.0 | 1034.0 | 237.78 | 0 | 0 | 3 | 2014 | 12 |
41919 | -22 | -3.0 | 1034.0 | 242.70 | 0 | 0 | 3 | 2014 | 12 |
41920 | -22 | -4.0 | 1034.0 | 246.72 | 0 | 0 | 3 | 2014 | 12 |
41921 | -21 | -3.0 | 1034.0 | 249.85 | 0 | 0 | 3 | 2014 | 12 |
41922 rows × 9 columns
X_columns=list(X.columns)
y_columns=["pollution"]
y['pollution']
0 129.0
1 148.0
2 159.0
3 181.0
4 138.0
...
41917 8.0
41918 10.0
41919 10.0
41920 8.0
41921 12.0
Name: pollution, Length: 41922, dtype: float64
correlation_result={}
for i in range(len(X_columns)):
correlation = X[X_columns[i]].corr(y["pollution"])
correlation_result[X_columns[i]]=correlation
correlation_result=sorted(correlation_result.items(), key = lambda kv:(kv[1], kv[0]),reverse=True)
correlation_result
[('dew', 0.24195768588072772),
('snow', 0.027031686105242845),
('temp', 0.0054214633141136176),
('month', -0.0035802732371824865),
('year', -0.004382839945228096),
('rain', -0.04478088117752061),
('wind_dir', -0.06654092218769357),
('press', -0.12746589207774287),
('wnd_spd', -0.24659006381823134)]
temp=[]
for i in correlation_result:
temp.append(i[0]) # key값 입력
X_train2=X[temp]
X_train2
dew | snow | temp | month | year | rain | wind_dir | press | wnd_spd | |
---|---|---|---|---|---|---|---|---|---|
0 | -16 | 0 | -4.0 | 1 | 2010 | 0 | 1 | 1020.0 | 1.79 |
1 | -15 | 0 | -4.0 | 1 | 2010 | 0 | 1 | 1020.0 | 2.68 |
2 | -11 | 0 | -5.0 | 1 | 2010 | 0 | 1 | 1021.0 | 3.57 |
3 | -7 | 1 | -5.0 | 1 | 2010 | 0 | 1 | 1022.0 | 5.36 |
4 | -7 | 2 | -5.0 | 1 | 2010 | 0 | 1 | 1022.0 | 6.25 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41917 | -23 | 0 | -2.0 | 12 | 2014 | 0 | 3 | 1034.0 | 231.97 |
41918 | -22 | 0 | -3.0 | 12 | 2014 | 0 | 3 | 1034.0 | 237.78 |
41919 | -22 | 0 | -3.0 | 12 | 2014 | 0 | 3 | 1034.0 | 242.70 |
41920 | -22 | 0 | -4.0 | 12 | 2014 | 0 | 3 | 1034.0 | 246.72 |
41921 | -21 | 0 | -3.0 | 12 | 2014 | 0 | 3 | 1034.0 | 249.85 |
41922 rows × 9 columns
top_5_features=[]
for i in range(5):
top_5_features.append(correlation_result[i][0])
top_5_features
['dew', 'snow', 'temp', 'month', 'year']
X_train=X[top_5_features]
#refhttps://stackoverflow.com/questions/39409866/correlation-heatmap
# calculate the correlation matrix
corr = X_train.corr()
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
def magnify():
return [dict(selector="th",
props=[("font-size", "15pt")]),
dict(selector="td",
props=[('padding', "0em 0em")]),
dict(selector="th:hover",
props=[("font-size", "12pt")]),
dict(selector="tr:hover td:hover",
props=[('max-width', '300px'),
('font-size', '12pt')])
]
corr.style.background_gradient(cmap, axis=1)\
.set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
.set_caption("Hover to magify")\
.set_precision(2)\
.set_table_styles(magnify())
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:21: FutureWarning: this method is deprecated in favour of `Styler.format(precision=..)`
Hover to magify
| | dew | snow | temp | month | year |
| --- | --- | --- | --- | --- | --- |
| dew | 1.00 | -0.03 | 0.83 | 0.23 | 0.01 |
| snow | -0.03 | 1.00 | -0.09 | -0.06 | -0.02 |
| temp | 0.83 | -0.09 | 1.00 | 0.16 | 0.06 |
| month | 0.23 | -0.06 | 0.16 | 1.00 | 0.01 |
| year | 0.01 | -0.02 | 0.06 | 0.01 | 1.00 |
X_train=df[top_5_features]
X_train
dew | snow | temp | month | year | |
---|---|---|---|---|---|
date | |||||
2010-01-02 00:00:00 | -16 | 0 | -4.0 | 1 | 2010 |
2010-01-02 01:00:00 | -15 | 0 | -4.0 | 1 | 2010 |
2010-01-02 02:00:00 | -11 | 0 | -5.0 | 1 | 2010 |
2010-01-02 03:00:00 | -7 | 1 | -5.0 | 1 | 2010 |
2010-01-02 04:00:00 | -7 | 2 | -5.0 | 1 | 2010 |
... | ... | ... | ... | ... | ... |
2014-12-31 19:00:00 | -23 | 0 | -2.0 | 12 | 2014 |
2014-12-31 20:00:00 | -22 | 0 | -3.0 | 12 | 2014 |
2014-12-31 21:00:00 | -22 | 0 | -3.0 | 12 | 2014 |
2014-12-31 22:00:00 | -22 | 0 | -4.0 | 12 | 2014 |
2014-12-31 23:00:00 | -21 | 0 | -3.0 | 12 | 2014 |
41922 rows × 5 columns