Getting Started With Python II

Kaggle/Titanic : Machine Learning from disaster

Getting Started With Python II

표독's 2016. 4. 7. 11:30

Getting Started with Pandas: Kaggle's Titanic Competition

Pandas와 함께 시작하기 : 케글의 타이타닉 경쟁

To recap the last tutorial: we got comfortable with Python for re-implementing the models we originally imagined in Excel. By using a programming language, we were able to (1) use more powerful constructs and methods, like arrays to store and retrieve variables, and (2) to write scripted steps that can be repeated in the future without us performing the work by hand.

마지막 튜토리얼을 요약해보자 : 우리는 본래 엑셀로 구현되었던 모델을 재 실행하는 법에 익숙해졌다. 프로그래밍언어를 사용하여, 우리는 (1) 변수를 찾아오고 저장하는 배열과 같은 더욱 강력한 구조와 방법들을 사용할 수 있게 되었고, (2) 우리가 손으로 했어야 하는 일들을 미래에 자동적으로 반복할 수 있는 스크립트 단계들을 쓸 수 있게 되었다.

However, you may be thinking that you found it easier to understand what's in the data back when you were using Excel. (When you wanted to see a column, you just scrolled over to look at it, rather than counting through the indices 0 to 8.) On the other hand, you might have statistics friends who tell you that life is better with the software R, which has the concept of a "data.frame". Well, in this third tutorial we will take a slight detour from our modeling work in order to bridge that gap.

하지만, 당신은 엑셀을 사용할 때 보다 데이터 이면에 대해 더 쉽게 이해할 수 있는 것이 궁금할 것이다.(당신이 칼럼을 볼 때, 당신은 0~8을 세기 보다는 스크롤을 했었을 것이다.) 다른 반면에, 당신은 더 나은 인생을 살기 위해서는 데이터 프레임을 가진 'R'을 사용하라는 통계학 친구도 있을 것이다. 어쨋든, 이 세 번째 튜토리얼에서는 우리는 이 간격에 다리를 놓기 위해서 우리의 모델링 된 작업을 살짝 우회하는 방법을 취할 것이다.

Python has another great package called Pandas, which makes data exploration and data cleaning much easier to do than manipulating arrays. It also lets you write code that's easier to read. Pandas has the concept of a DataFrame, too, which is like a spreadsheet with more programmatic power. Finally, if you go searching for additional tutorials in the forums someday, you'll often find that the author uses Pandas.

파이썬은 Pandas라는 아주 훌륭한 패키지를 가지고 있는데, 이것은 배열을 조작하여 데이터 탐구와 처리를 가능하게 한다. 이것은 당신이 코드를 읽기 쉽도록 쓰게 해준다. 판다스는 더 프로그래밍적 힘을 가진 스프레드시트와 같은 DataFrame의 개념을 가지고 잇다. 최종적으로, 만약 당신이 훗날 더 추가적인 포럼을 찾을 때, 자주 Pandas를 사용하는 사례를 볼 수 있을 것이다.

This tutorial is a little different than the first two: this is not a cohesive script to be run, nor part of a sample .py found on the Data Page. Instead, this tutorial is meant to entered line by line on your python command line, so that you can learn some of the methods at your disposal and see what occurs. You might even deviate from this tutorial with other variables that interest you. Finally, at times the output from your command will be very long-winded, so not everything is printed in its entirety here.

이 튜토리얼은 처음 두 번 것 과는 약간 다르다. 이건은 응집적 스크립트가 아닐 뿐더러 Data Page에서 .py의 부분도 아니다. 대신에, 이 튜토리얼은 당신의 제안관련 방법들의 일부를 배울 수 있도록, 그리고 어떤 일이 일어나는지를 볼 수 있도록 당신의 명령창에서 한 줄 한 줄 써지도록 되어 있다. 최적적으로, 당신의 명령줄의 결과는 매우 길 것이다. 그래서 모든 것을 여기서 보여주진 않는다.

Ready? If you have it installed, this would be a great time to utilize ipython or ipython notebook. Otherwise, run python.

준비 되었나요? 만약 당신이 이것을 설치했다면, ipython 또는 노트북을 사용하기 좋은 시간이라고 생각되며, 아니면 파이썬을 실행하세요.

Numpy Arrays

Let's review what our train.csv data looked like in python up to this point. Run the following to load the data again:

Numpy 배열

우리의 train.csv를 다시 보자. 데이터를 부르기 위해 다음을 따라해보세요

import csv as csv
import numpy as np

csv_file_object = csv.reader(open('E:\kaggle\\titanic\\train.csv','rb'))
header = csv_file_object.next()
data = []

for row in csv_file_object:
    data.append(row)
data = np.array(data)

그리고 data를 프린트 해봅시다.

In[14]: data
Out[13]: 
array([['1', '0', '3', ..., '7.25', '', 'S'],
       ['2', '1', '1', ..., '71.2833', 'C85', 'C'],
       ['3', '1', '3', ..., '7.925', '', 'S'],
       ...,

This is familiar... an array of strings that the csv package was able to read.

Look at the first 15 rows of the Age column: data[0:15,5]

이것은 비슷합니다. csv 패키지는 문자의 배열을 읽을 수 있습니다.

Age 칼럼의 15번째 값을 보고 싶습니다.

In[15]: data[0:15,5]
Out[14]: 
array(['22', '38', '26', '35', '35', '', '54', '2', '27', '14', '4', '58',
       '20', '39', '14'], 
      dtype='|S82')

Great, that command gives just the ages, and they are still stored as strings. What type of object is this whole column, though?

훌륭해요! 이 명령문은 나이들만 주고, 그것들은 아직 문자열로 저장되어 있습니다. 전체 칼럼 중 어떤 값이 그 변수에 들어 있을까요?

In[16]: type(data[::,5])
Out[15]: numpy.ndarray

So, any slice we take from the data is still a Numpy array. Now let's see if we can take the mean of the passenger ages. They will need to be floats instead of strings, so set this up as:

그러므로, 데이터로 부터 취하는 어떤 조각 부분들고 아직 Numpy 배열입니다. 이제 손님들의 나이들 평균을 한 번 알아봅시다. 그것들은 문자열형태가 아닌 정수형태가 되어야 합니다. 그래서 다음과 같이 합니다.

In[17]: ages_onboard = data[0::, 5].astype(np.float)

Traceback (most recent call last):
  File "C:\Users\Jusung\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py", line 3066, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-513c6ab0970f>", line 1, in <module>
    ages_onboard = data[0::, 5].astype(np.float)
ValueError: could not convert string to float:

Hmm. This seemed to be working for the first few rows, but then produced an error when numpy got to the missing value ' ' in the 6th row. There is surely a way to use Python to filter out the missing values, then convert to float, then take the mean -- but this isn't sounding easy anymore. So let's try again with Pandas.

음. 이것은 처음 몇 줄 까지는 작동하는 것으로 보이나 numpy가 6번 째 빈 값을 처리하지는 못하는 것으로 보입니다. 파이썬은 확실히 결측값을 처리하는 필터를 가지고 있고, 그 다음 정수형태로 바꾼다음 평균 값을 취할 것입니다. 그러나 쉬워 보이는 길은 아닌 것 같습니다. 그래서 한번 Pandas와 함께 해봅시다.

Pandas DataFrame

The first thing we have to do is import the Pandas package. It turns out that Pandas has its own functions to read or write a .csv file, so we are no longer actually using the csv package in the commands below. Let's create a new object called 'df' for storing the pandas version of train.csv. (This means you can still refer to the original 'data' numpy array for the rest of this tutorial anytime you want to compare and contrast.)

Pandas 데이터 프레임

처음에 우리가 해야할 것은 Pandas 패키지를 부르는 것입니다. Pandas는 csv파일을 읽고 쓰는 자신만의 기능을 갖고 있기 때문에 우리는 사실상 명령으로 csv 패키지를 사용하지 않을 것입니다. 한 번 train.csv를 Pandas버전으로 저장하기 위해 새로운 'df'객체를 만들어 봅시다. ( 이것은 당신이 이 튜토리얼 나머지 부분에서 언제나 비교하고 대조하기 위해 original 'data' numpy와 비교할 수 있음을 의미합니다.)

In[18]: import pandas as pd

Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[19]: import numpy as np

In[20]: # read_csv를 하기 위해서 당신이 첫 번째 줄이 헤더인 것을 알 때 항상 header=0 을 사용하세요,
df = pd.read_csv('E:\kaggle\\titanic\\train.csv', header=0)

In[21]: df

Out[20]: 
     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25            26         1       3   
26            27         0       3   
27            28         0       1   
28            29         1       3   
29            30         0       3   
..           ...       ...     ...   
861          862         0       2   
862          863         1       1   
863          864         0       3   
864          865         0       2   
865          866         1       2   
866          867         1       2   
867          868         0       1   
868          869         0       3   
869          870         1       3   
870          871         0       3   
871          872         1       1   
872          873         0       1   
873          874         0       3   
874          875         1       2   
875          876         1       3   
876          877         0       3   
877          878         0       3   
878          879         0       3   
879          880         1       1   
880          881         1       2   
881          882         0       3   
882          883         0       3   
883          884         0       2   
884          885         0       3   
885          886         0       3   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex  Age  SibSp  \
0                              Braund, Mr. Owen Harris    male   22      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                               Heikkinen, Miss. Laina  female   26      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4                             Allen, Mr. William Henry    male   35      0   
5                                     Moran, Mr. James    male  NaN      0   
6                              McCarthy, Mr. Timothy J    male   54      0   
7                       Palsson, Master. Gosta Leonard    male    2      3   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female   27      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)  female   14      1   
10                     Sandstrom, Miss. Marguerite Rut  female    4      1   
11                            Bonnell, Miss. Elizabeth  female   58      0   
12                      Saundercock, Mr. William Henry    male   20      0   
13                         Andersson, Mr. Anders Johan    male   39      1   
14                Vestrom, Miss. Hulda Amanda Adolfina  female   14      0   
15                    Hewlett, Mrs. (Mary D Kingcome)   female   55      0   
16                                Rice, Master. Eugene    male    2      4   
17                        Williams, Mr. Charles Eugene    male  NaN      0   
18   Vander Planke, Mrs. Julius (Emelia Maria Vande...  female   31      1   
19                             Masselmani, Mrs. Fatima  female  NaN      0   
20                                Fynney, Mr. Joseph J    male   35      0   
21                               Beesley, Mr. Lawrence    male   34      0   
22                         McGowan, Miss. Anna "Annie"  female   15      0   
23                        Sloper, Mr. William Thompson    male   28      0   
24                       Palsson, Miss. Torborg Danira  female    8      3   
25   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...  female   38      1   
26                             Emir, Mr. Farred Chehab    male  NaN      0   
27                      Fortune, Mr. Charles Alexander    male   19      3   
28                       O'Dwyer, Miss. Ellen "Nellie"  female  NaN      0   
29                                 Todoroff, Mr. Lalio    male  NaN      0   
..                                                 ...     ...  ...    ...   
861                        Giles, Mr. Frederick Edward    male   21      1   
862  Swift, Mrs. Frederick Joel (Margaret Welles Ba...  female   48      0   
863                  Sage, Miss. Dorothy Edith "Dolly"  female  NaN      8   
864                             Gill, Mr. John William    male   24      0   
865                           Bystrom, Mrs. (Karolina)  female   42      0   
866                       Duran y More, Miss. Asuncion  female   27      1   
867               Roebling, Mr. Washington Augustus II    male   31      0   
868                        van Melkebeke, Mr. Philemon    male  NaN      0   
869                    Johnson, Master. Harold Theodor    male    4      1   
870                                  Balkic, Mr. Cerin    male   26      0   
871   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)  female   47      1   
872                           Carlsson, Mr. Frans Olof    male   33      0   
873                        Vander Cruyssen, Mr. Victor    male   47      0   
874              Abelson, Mrs. Samuel (Hannah Wizosky)  female   28      1   
875                   Najib, Miss. Adele Kiamie "Jane"  female   15      0   
876                      Gustafsson, Mr. Alfred Ossian    male   20      0   
877                               Petroff, Mr. Nedelio    male   19      0   
878                                 Laleff, Mr. Kristo    male  NaN      0   
879      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)  female   56      0   
880       Shelley, Mrs. William (Imanita Parrish Hall)  female   25      0   
881                                 Markun, Mr. Johann    male   33      0   
882                       Dahlberg, Miss. Gerda Ulrika  female   22      0   
883                      Banfield, Mr. Frederick James    male   28      0   
884                             Sutehall, Mr. Henry Jr    male   25      0   
885               Rice, Mrs. William (Margaret Norton)  female   39      0   
886                              Montvila, Rev. Juozas    male   27      0   
887                       Graham, Miss. Margaret Edith  female   19      0   
888           Johnston, Miss. Catherine Helen "Carrie"  female  NaN      1   
889                              Behr, Mr. Karl Howell    male   26      0   
890                                Dooley, Mr. Patrick    male   32      0   

     Parch            Ticket      Fare        Cabin Embarked  
0        0         A/5 21171    7.2500          NaN        S  
1        0          PC 17599   71.2833          C85        C  
2        0  STON/O2. 3101282    7.9250          NaN        S  
3        0            113803   53.1000         C123        S  
4        0            373450    8.0500          NaN        S  
5        0            330877    8.4583          NaN        Q  
6        0             17463   51.8625          E46        S  
7        1            349909   21.0750          NaN        S  
8        2            347742   11.1333          NaN        S  
9        0            237736   30.0708          NaN        C  
10       1           PP 9549   16.7000           G6        S  
11       0            113783   26.5500         C103        S  
12       0         A/5. 2151    8.0500          NaN        S  
13       5            347082   31.2750          NaN        S  
14       0            350406    7.8542          NaN        S  
15       0            248706   16.0000          NaN        S  
16       1            382652   29.1250          NaN        Q  
17       0            244373   13.0000          NaN        S  
18       0            345763   18.0000          NaN        S  
19       0              2649    7.2250          NaN        C  
20       0            239865   26.0000          NaN        S  
21       0            248698   13.0000          D56        S  
22       0            330923    8.0292          NaN        Q  
23       0            113788   35.5000           A6        S  
24       1            349909   21.0750          NaN        S  
25       5            347077   31.3875          NaN        S  
26       0              2631    7.2250          NaN        C  
27       2             19950  263.0000  C23 C25 C27        S  
28       0            330959    7.8792          NaN        Q  
29       0            349216    7.8958          NaN        S  
..     ...               ...       ...          ...      ...  
861      0             28134   11.5000          NaN        S  
862      0             17466   25.9292          D17        S  
863      2          CA. 2343   69.5500          NaN        S  
864      0            233866   13.0000          NaN        S  
865      0            236852   13.0000          NaN        S  
866      0     SC/PARIS 2149   13.8583          NaN        C  
867      0          PC 17590   50.4958          A24        S  
868      0            345777    9.5000          NaN        S  
869      1            347742   11.1333          NaN        S  
870      0            349248    7.8958          NaN        S  
871      1             11751   52.5542          D35        S  
872      0               695    5.0000  B51 B53 B55        S  
873      0            345765    9.0000          NaN        S  
874      0         P/PP 3381   24.0000          NaN        C  
875      0              2667    7.2250          NaN        C  
876      0              7534    9.8458          NaN        S  
877      0            349212    7.8958          NaN        S  
878      0            349217    7.8958          NaN        S  
879      1             11767   83.1583          C50        C  
880      1            230433   26.0000          NaN        S  
881      0            349257    7.8958          NaN        S  
882      0              7552   10.5167          NaN        S  
883      0  C.A./SOTON 34068   10.5000          NaN        S  
884      0   SOTON/OQ 392076    7.0500          NaN        S  
885      5            382652   29.1250          NaN        Q  
886      0            211536   13.0000          NaN        S  
887      0            112053   30.0000          B42        S  
888      2        W./C. 6607   23.4500          NaN        S  
889      0            111369   30.0000         C148        C  
890      0            370376    7.7500          NaN        Q  

[891 rows x 12 columns]

That wasn't so helpful. So let's look at just the first few rows:

별로 도움이 안됩니다. 처음 몇 줄만 불러봅시다.

In[22]: df.head(3)

Out[21]: 
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S

You notice it has column names, and it has the index of rows labelled down the side. (Note: you can also try df.tail(3) and you can feed it any number of rows.) Now, compared to the original data array, what kind of object is this?

열이 이름들을 갖는 것을 알 수 있습니다. 그리고 끝쪽에는 열의 index가 있는 것을 알 수 잇습니다. ( 우리는 df.tail(3)도 시도해 볼 수 있습니다.) 처음 것과 비교해 봅시다. 어떤 종류의 객체일까요?

In[23]: type(df)
Out[22]: pandas.core.frame.DataFrame

Recall that using the csv package before, every value was interpreted as a string. But how does Pandas interpret them using its own csv reader?

이전에 csv 패키지를 사용했던 것을 회상해 봅시다. 모든 값들은 문자열로 쓰였습니다. 그러면 어떻게 Pandas는 그것들을 자신 만의 csv reader로 부를까요?

In[25]: df.dtypes
Out[24]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

Pandas is able to infer numerical types whenever it can detect them. So we have values already stored as integers. When it detected the existing decimal points somewhere in Age and Fare, it converted those columns to float. There are two more very valuable commands to learn on a dataframe:

Pandas가 언제든지 숫자를 발견하면 그것을 숫자로 규정합니다. 그래서 우리는 이 값들이 이미 정수로 분류되어있음을 알 수 있습니다. 그리고 소숫점을 가지고 있다면 (나이와 요금에서) 이것은 정수로 분류합니다. 여기에는 데이터프레임상에서 배우기 위한 두가지 이상의 다양한 변수 명령문이 있습니다.

In[26]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

There's a lot of useful info there! You can see immediately we have 891 entries (rows), and for most of the variables we have complete values (891 are non-null). But not for Age, or Cabin, or Embarked -- those have nulls somewhere. Now try:

여기에서는 많은 정보들이 있습니다. 당신은 즉시 우리가 891개의 행을 가지고 대부분의 변수는 완벽하게 들어있습니다. 그러나 Age, Cabin 그리고 Embarked 변수들은 어디엔가 결측값이 존재합니다.

Data Munging

데이터 멍잉

One step in any data analysis is the data cleaning. Thankfully pandas makes things easier to filter, manipulate, drop out, fill in, transform and replace values inside the dataframe. Below we also learn the syntax that pandas allows for referring to specific columns.

Referencing and filtering

Let's acquire the first 10 rows of the Age column. In pandas this is

저작자표시 (새창열림)

'Kaggle > Titanic : Machine Learning from disaster' 카테고리의 다른 글

Getting Started With Python (0)	2016.03.19

현재글Getting Started With Python II

새로운 바람

sas dump, TOEIC 공부 시작, SAS, 이코노미스트, ADsP, 토익 공부, CRAMBIBLE 문제풀이, SAS CRAMBIBLE, TOEIC LC, SAS solution, 한국데이터베이스진흥원, 빅데이터, 데이터 분석 준전문가, 토익 RC 공략, SAE BASE, TOEIC, CRAMBIBLE, 데이터 분석 전문가 가이드, 데이터 사이언스, 준전문가시시험 요약,

Today :
Yesterday :

새로운 바람