學(xué)習(xí)利用python進行數(shù)據(jù)分析的筆記&下星期二內(nèi)部交流會要講的內(nèi)容,一并分享給大家。博主粗心大意,有什么不對的地方歡迎指正~還有許多尚待完善的地方,待我一邊學(xué)習(xí)一邊完善~
前言:各種和數(shù)據(jù)分析相關(guān)python庫的介紹(前言1~4摘抄自《利用python進行數(shù)據(jù)分析》)1.Numpy: Numpy是python科學(xué)計算的基礎(chǔ)包,它提供以下功能(不限于此): (1)快速高效的多維數(shù)組對象naarray (2)用于對數(shù)組執(zhí)行元素級計算以及直接對數(shù)組執(zhí)行數(shù)學(xué)運算的函數(shù) (3)用于讀寫硬盤上基于數(shù)組的數(shù)據(jù)集的工具 (4)線性代數(shù)運算、傅里葉變換,以及隨機數(shù)生成 (5)用于將C、C++、Fortran代碼集成到python的工具2.pandas pandas提供了使我們能夠快速便捷地處理結(jié)構(gòu)化數(shù)據(jù)的大量數(shù)據(jù)結(jié)構(gòu)和函數(shù)。pandas兼具Numpy高性能的數(shù)組計算功能以及電子表格和關(guān)系型數(shù)據(jù)(如SQL)靈活的數(shù)據(jù)處理能力。它提供了復(fù)雜精細的索引功能,以便更為便捷地完成重塑、切片和切塊、聚合以及選取數(shù)據(jù)子集等操作。 對于金融行業(yè)的用戶,pandas提供了大量適用于金融數(shù)據(jù)的高性能時間序列功能和工具?! ataFrame是pandas的一個對象,它是一個面向列的二維表結(jié)構(gòu),且含有行標和列標?! s.引用一段網(wǎng)上的話說明DataFrame的強大之處: Excel 2007及其以后的版本的最大行數(shù)是1048576,最大列數(shù)是16384,超過這個規(guī)模的數(shù)據(jù)Excel就會彈出個框框“此文本包含多行文本,無法放置在一個工作表中”。Pandas處理上千萬的數(shù)據(jù)是易如反掌的事情,同時隨后我們也將看到它比SQL有更強的表達能力,可以做很多復(fù)雜的操作,要寫的code也更少。 說了一大堆它的好處,要實際感觸還得動手碼代碼。3.matplotlib matplotlib是最流行的用于繪制數(shù)據(jù)圖表的python庫。4.Scipy Scipy是一組專門解決科學(xué)計算中各種標準問題域的包的集合。
5.statsmodels: https://github.com/statsmodels/statsmodels
6.scikit-learn: http://scikit-learn.org/stable/
一.數(shù)據(jù)導(dǎo)入和導(dǎo)出(一)讀取csv文件 1.本地讀取
import pandas as pddf = pd.read_csv('E:\\tips.csv') #根據(jù)自己數(shù)據(jù)文件保存的路徑填寫(p.s. python填寫路徑時,要么使用/,要么使用\\)
#輸出: total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 45 25.29 4.71 Male No Sun Dinner 4.. ... ... ... ... ... ... ...240 27.18 2.00 Female Yes Sat Dinner 2241 22.67 2.00 Male Yes Sat Dinner 2242 17.82 1.75 Male No Sat Dinner 2243 18.78 3.00 Female No Thur Dinner 2[244 rows x 7 columns]
2.網(wǎng)絡(luò)讀取
import pandas as pddata_url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv" #填寫url讀取df = pd.read_csv(data_url)#輸出同上,為了節(jié)省篇幅這兒就不粘貼了
3.read_csv詳解功能: Read CSV (comma-separated) file into DataFrame
read_csv(filepath_or_buffer, sep=',', dialect=None, compression='infer', doublequote=True, escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None, header='infer', index_col=None, names=None, prefix=None, skiprows=None, skipfooter=None, skip_footer=0, na_values=None, true_values=None, false_values=None, delimiter=None, converters=None, dtype=None, usecols=None, engine=None, delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False, use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True, error_bad_lines=True, keep_default_na=True, thousands=None, comment=None, decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None, memory_map=False, float_precision=None, nrows=None, iterator=False, chunksize=None, verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False, infer_datetime_format=False, skip_blank_lines=True)
參數(shù)詳解:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html(二)讀取Mysql數(shù)據(jù) 假設(shè)數(shù)據(jù)庫安裝在本地,用戶名為myusername,密碼為mypassword,要讀取mydb數(shù)據(jù)庫中的數(shù)據(jù)
import pandas as pdimport MySQLdbmysql_cn= MySQLdb.connect(host='localhost', port=3306,user='myusername', passwd='mypassword', db='mydb')df = pd.read_sql('select * from test;', con=mysql_cn) mysql_cn.close()
上面的代碼讀取了test表中所有的數(shù)據(jù)到df中,而df的數(shù)據(jù)結(jié)構(gòu)為Dataframe。ps.MySQL教程:http://www.runoob.com/mysql/mysql-tutorial.html(三)讀取excel文件要讀取excel文件還需要安裝xlrd模塊,pip install xlrd即可。
df = pd.read_excel('E:\\tips.xls')
(四)數(shù)據(jù)導(dǎo)出到csv文件
df.to_csv('E:\\demo.csv', encoding='utf-8', index=False) #index=False表示導(dǎo)出時去掉行名稱,如果數(shù)據(jù)中含有中文,一般encoding指定為‘utf-8’
(五)讀寫SQL數(shù)據(jù)庫
import pandas as pdimport sqlite3con = sqlite3.connect('...')sql = '...'df=pd.read_sql(sql,con)#help文件help(sqlite3.connect)#輸出Help on built-in function connect in module _sqlite3:connect(...) connect(database[, timeout, isolation_level, detect_types, factory]) Opens a connection to the SQLite database file *database*. You can use ":memory:" to open a database connection to a database that resides in RAM instead of on disk.#############help(pd.read_sql)#輸出Help on function read_sql in module pandas.io.sql:read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None) Read SQL query or database table into a DataFrame.
ps.數(shù)據(jù)庫的代碼是我直接從網(wǎng)絡(luò)上粘貼過來的,沒有測試過是不是可行,先貼上來。
數(shù)據(jù)庫我還在摸索中,學(xué)習(xí)心得學(xué)習(xí)筆記之類的大家可以一起分享23333~
二.提取和篩選需要的數(shù)據(jù)(一)提取和查看相應(yīng)數(shù)據(jù) (用的是tips.csv的數(shù)據(jù),數(shù)據(jù)來源:https://github.com/mwaskom/seaborn-data)
print df.head() #打印數(shù)據(jù)前五行
#輸出 total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
print df.tail() #打印數(shù)據(jù)后5行#輸出 total_bill tip sex smoker day time size239 29.03 5.92 Male No Sat Dinner 3240 27.18 2.00 Female Yes Sat Dinner 2241 22.67 2.00 Male Yes Sat Dinner 2242 17.82 1.75 Male No Sat Dinner 2243 18.78 3.00 Female No Thur Dinner 2
print df.columns #打印列名#輸出Index([u'total_bill', u'tip', u'sex', u'smoker', u'day', u'time', u'size'], dtype='object')
print df.index #打印行名#輸出Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 234, 235, 236, 237, 238, 239, 240, 241, 242, 243], dtype='int64', length=244)
print df.ix[10:20, 0:3] #打印10~20行前三列數(shù)據(jù)#輸出 total_bill tip sex10 10.27 1.71 Male11 35.26 5.00 Female12 15.42 1.57 Male13 18.43 3.00 Male14 14.83 3.02 Female15 21.58 3.92 Male16 10.33 1.67 Female17 16.29 3.71 Male18 16.97 3.50 Female19 20.65 3.35 Male20 17.92 4.08 Male
#提取不連續(xù)行和列的數(shù)據(jù),這個例子提取的是第1,3,5行,第2,4列的數(shù)據(jù)df.iloc[[1,3,5],[2,4]]#輸出 sex day1 Male Sun3 Male Sun5 Male Sun
#專門提取某一個數(shù)據(jù),這個例子提取的是第三行,第二列數(shù)據(jù)(默認從0開始算哈)df.iat[3,2]#輸出'Male'
print df.drop(df.columns[1, 2], axis = 1) #舍棄數(shù)據(jù)前兩列print df.drop(df.columns[[1, 2]], axis = 0) #舍棄數(shù)據(jù)前兩行#為了節(jié)省篇幅結(jié)果就不貼出來了哈~
print df.shape #打印維度#輸出(244, 7)
df.iloc[3] #選取第3行#輸出1total_bill 23.68tip 3.31sex Malesmoker Noday Suntime Dinnersize 2Name: 3, dtype: objectdf.iloc[2:4] #選取第2到第3行#輸出2 total_bill tip sex smoker day time size2 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 2df.iloc[0,1] #選取第0行1列的元素#輸出31.01
(二)篩選出需要的數(shù)據(jù)(用的是tips.csv的數(shù)據(jù),數(shù)據(jù)來源:https://github.com/mwaskom/seaborn-data)
#example:假設(shè)我們要篩選出小費大于$8的數(shù)據(jù)df[df.tip>8]#輸出 total_bill tip sex smoker day time size170 50.81 10 Male Yes Sat Dinner 3212 48.33 9 Male No Sat Dinner 4
#數(shù)據(jù)篩選同樣可以用”或“和”且“作為篩選條件,比如#1df[(df.tip>7)|(df.total_bill>50)] #篩選出小費大于$7或總賬單大于$50的數(shù)據(jù)#輸出 total_bill tip sex smoker day time size23 39.42 7.58 Male No Sat Dinner 4170 50.81 10.00 Male Yes Sat Dinner 3212 48.33 9.00 Male No Sat Dinner 4#2df[(df.tip>7)&(df.total_bill>50)]#篩選出小費大于$7且總賬單大于$50的數(shù)據(jù)#輸出 total_bill tip sex smoker day time size170 50.81 10 Male Yes Sat Dinner 3
#接上#假如加入了篩選條件后,我們只關(guān)心day和timedf[['day','time']][(df.tip>7)|(df.total_bill>50)]#輸出 day time23 Sat Dinner170 Sat Dinner212 Sat Dinner
三.統(tǒng)計描述(用的是tips.csv的數(shù)據(jù),數(shù)據(jù)來源:https://github.com/mwaskom/seaborn-data)
print df.describe() #描述性統(tǒng)計#輸出 各指標都比較簡單就不解釋了哈 total_bill tip sizecount 244.000000 244.000000 244.000000mean 19.785943 2.998279 2.569672std 8.902412 1.383638 0.951100min 3.070000 1.000000 1.00000025% 13.347500 2.000000 2.00000050% 17.795000 2.900000 2.00000075% 24.127500 3.562500 3.000000max 50.810000 10.000000 6.000000
四.數(shù)據(jù)處理
(一)數(shù)據(jù)轉(zhuǎn)置(用的是tips.csv的數(shù)據(jù),數(shù)據(jù)來源:https://github.com/mwaskom/seaborn-data)
print df.T#output 0 1 2 3 4 5 6 7 total_bill 16.99 10.34 21.01 23.68 24.59 25.29 8.77 26.88 tip 1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 sex Female Male Male Male Female Male Male Male smoker No No No No No No No No day Sun Sun Sun Sun Sun Sun Sun Sun time Dinner Dinner Dinner Dinner Dinner Dinner Dinner Dinner size 2 3 3 2 4 4 2 4 8 9 ... 234 235 236 237 238 total_bill 15.04 14.78 ... 15.53 10.07 12.6 32.83 35.83 tip 1.96 3.23 ... 3 1.25 1 1.17 4.67 sex Male Male ... Male Male Male Male Female smoker No No ... Yes No Yes Yes No day Sun Sun ... Sat Sat Sat Sat Sat time Dinner Dinner ... Dinner Dinner Dinner Dinner Dinner size 2 2 ... 2 2 2 2 3 239 240 241 242 243 total_bill 29.03 27.18 22.67 17.82 18.78 tip 5.92 2 2 1.75 3 sex Male Female Male Male Female smoker No Yes Yes No No day Sat Sat Sat Sat Thur time Dinner Dinner Dinner Dinner Dinner size 3 2 2 2 2 [7 rows x 244 columns]
(二)數(shù)據(jù)排序(用的是tips.csv的數(shù)據(jù),數(shù)據(jù)來源:https://github.com/mwaskom/seaborn-data)
df.sort_values(by='tip') #按tip列升序排序#輸出(為了不占篇幅我簡化了一部分) total_bill tip sex smoker day time size67 3.07 1.00 Female Yes Sat Dinner 1236 12.60 1.00 Male Yes Sat Dinner 292 5.75 1.00 Female Yes Fri Dinner 2111 7.25 1.00 Female No Sat Dinner 10 16.99 1.01 Female No Sun Dinner 2.. ... ... ... ... ... ... ...214 28.17 6.50 Female Yes Sat Dinner 3141 34.30 6.70 Male No Thur Lunch 659 48.27 6.73 Male No Sat Dinner 423 39.42 7.58 Male No Sat Dinner 4212 48.33 9.00 Male No Sat Dinner 4170 50.81 10.00 Male Yes Sat Dinner 3[244 rows x 7 columns]
(三)缺失值處理
1.填充缺失值(數(shù)據(jù)來自《利用python進行數(shù)據(jù)分析》第二章 usagov_bitly_data2012-03-16-1331923249.txt,需要的同學(xué)可以找我要)
import json #python有許多內(nèi)置或第三方模塊可以將JSON字符串轉(zhuǎn)換成python字典對象import pandas as pdimport numpy as npfrom pandas import DataFramepath = 'F:\PycharmProjects\pydata-book-master\ch02\usagov_bitly_data2012-03-16-1331923249.txt' #根據(jù)自己的路徑填寫records = [json.loads(line) for line in open(path)]frame = DataFrame(records)frame['tz']#輸出(為了節(jié)省篇幅我刪除了部分輸出結(jié)果)0 America/New_York1 America/Denver2 America/New_York3 America/Sao_Paulo4 America/New_York5 America/New_York6 Europe/Warsaw7 8 9 10 America/Los_Angeles11 America/New_York12 America/New_York13 NaN ... Name: tz, dtype: object
從以上輸出值可以看出數(shù)據(jù)存在未知或缺失值,接著咱們來處理缺失值。
print frame['tz'].fillna(1111111111111) #以數(shù)字代替缺失值#輸出結(jié)果(為了節(jié)省篇幅我刪除了部分輸出結(jié)果)0 America/New_York1 America/Denver2 America/New_York3 America/Sao_Paulo4 America/New_York5 America/New_York6 Europe/Warsaw7 8 9 10 America/Los_Angeles11 America/New_York12 America/New_York13 1111111111111Name: tz, dtype: object
print frame['tz'].fillna('YuJie2333333333333') #用字符串代替缺失值#輸出(為了節(jié)省篇幅我刪除了部分輸出結(jié)果)0 America/New_York1 America/Denver2 America/New_York3 America/Sao_Paulo4 America/New_York5 America/New_York6 Europe/Warsaw7 8 9 10 America/Los_Angeles11 America/New_York12 America/New_York13 YuJie2333333333333Name: tz, dtype: object
還有:
print frame['tz'].fillna(method='pad') #用前一個數(shù)據(jù)代替缺失值print frame['tz'].fillna(method='bfill') #用后一個數(shù)據(jù)代替缺失值
2.刪除缺失值 (數(shù)據(jù)同上)
print frame['tz'].dropna(axis=0) #刪除缺失行print frame['tz'].dropna(axis=1) #刪除缺失列
3.插值法填補缺失值
由于沒有數(shù)據(jù),這兒插播一個小知識點:創(chuàng)建一個隨機的數(shù)據(jù)框
import pandas as pdimport numpy as np#創(chuàng)建一個6*4的數(shù)據(jù)框,randn函數(shù)用于創(chuàng)建隨機數(shù)czf_data = pd.DataFrame(np.random.randn(6,4),columns=list('ABCD')) czf_data#輸出 A B C D0 0.355690 1.165004 0.810392 -0.8189821 0.496757 -0.490954 -0.407960 -0.4935022 -0.202123 -0.842278 -0.948464 0.2237713 0.969445 1.357910 -0.479598 -1.1994284 0.125290 0.943056 -0.082404 -0.3636405 -1.762905 -1.471447 0.351570 -1.546152
好啦,數(shù)據(jù)就出來了。接著我們用空值替換數(shù)值,創(chuàng)造出一個含有空值的DataFrame。
#把第二列數(shù)據(jù)設(shè)置為缺失值czf_data.ix[2,:]=np.nanczf_data#輸出 A B C D0 0.355690 1.165004 0.810392 -0.8189821 0.496757 -0.490954 -0.407960 -0.4935022 NaN NaN NaN NaN3 0.969445 1.357910 -0.479598 -1.1994284 0.125290 0.943056 -0.082404 -0.3636405 -1.762905 -1.471447 0.351570 -1.546152
#接著就可以利用插值法填補空缺值了~print czf_data.interpolate()#輸出 A B C D0 0.355690 1.165004 0.810392 -0.8189821 0.496757 -0.490954 -0.407960 -0.4935022 0.733101 0.433478 -0.443779 -0.8464653 0.969445 1.357910 -0.479598 -1.1994284 0.125290 0.943056 -0.082404 -0.3636405 -1.762905 -1.471447 0.351570 -1.546152
(四)數(shù)據(jù)分組(用的是tips.csv的數(shù)據(jù),數(shù)據(jù)來源:https://github.com/mwaskom/seaborn-data)
group = df.groupby('day') #按day這一列進行分組#1print group.first()#打印每一組的第一行數(shù)據(jù)#輸出 total_bill tip sex smoker time sizeday Fri 28.97 3.00 Male Yes Dinner 2Sat 20.65 3.35 Male No Dinner 3Sun 16.99 1.01 Female No Dinner 2Thur 27.20 4.00 Male No Lunch 4#2print group.last()#打印每一組的最后一行數(shù)據(jù)#輸出 total_bill tip sex smoker time sizeday Fri 10.09 2.00 Female Yes Lunch 2Sat 17.82 1.75 Male No Dinner 2Sun 15.69 1.50 Male Yes Dinner 2Thur 18.78 3.00 Female No Dinner 2
(五)值替換
import pandas as pdimport numpy as np#首先創(chuàng)造一個Series(沒有數(shù)據(jù)情況下的福音233)Series = pd.Series([0,1,2,3,4,5])#輸出Series0 01 12 23 34 45 5dtype: int64
#數(shù)值替換,例如將0換成10000000000000print Series.replace(0,10000000000000)#輸出0 100000000000001 12 23 34 45 5dtype: int64
#列和列的替換同理print Series.replace([0,1,2,3,4,5],[11111,222222,3333333,44444,55555,666666])#輸出0 111111 2222222 33333333 444444 555555 666666dtype: int64
五.統(tǒng)計分析
(一)t檢驗
1.獨立樣本t檢驗
兩獨立樣本t檢驗就是根據(jù)樣本數(shù)據(jù)對兩個樣本來自的兩獨立總體的均值是否有顯著差異進行推斷;進行兩獨立樣本t檢驗的條件是,兩樣本的總體相互獨立且符合正態(tài)分布。
開始找不到合適的數(shù)據(jù),我就在網(wǎng)上隨便摘抄了個spss做獨立樣本t檢驗的實例數(shù)據(jù)作為例子大家暫時看著吧找到合適的例子再給大家舉~
數(shù)據(jù)如下,我將數(shù)據(jù)保存為本地xlsx格式:
import pandas as pdfrom scipy.stats import ttest_indIS_t_test = pd.read_excel('E:\\IS_t_test.xlsx') Group1 = IS_t_test[IS_t_test['group']==1]['data']Group2 = IS_t_test[IS_t_test['group']==2]['data']print ttest_ind(Group1,Group2)#輸出(-4.7515451390104353, 0.0014423819408438474)
輸出結(jié)果的第一個元素為t值,第二個元素為p-value
ttest_ind默認兩組數(shù)據(jù)方差齊性的,如果想要設(shè)置默認方差不齊,可以設(shè)置equal_var=False
print ttest_ind(Group1,Group2,equal_var=True)print ttest_ind(Group1,Group2,equal_var=False)#輸出(-4.7515451390104353, 0.0014423819408438474)(-4.7515451390104353, 0.0014425608643614844)
2.配對樣本t檢驗
同樣找不到數(shù)據(jù),讓我們暫且假設(shè)上邊獨立樣本是配對樣本吧,使用同樣的數(shù)據(jù)。
import pandas as pdfrom scipy.stats import ttest_relIS_t_test = pd.read_excel('E:\\IS_t_test.xlsx') Group1 = IS_t_test[IS_t_test['group']==1]['data']Group2 = IS_t_test[IS_t_test['group']==2]['data']print ttest_rel(Group1,Group2)#輸出(-5.6873679190073361, 0.00471961872448184)
同樣的,輸出結(jié)果的第一個元素為t值,第二個元素為p-value。
(二)方差分析
1.單因素方差分析
這里依然沿用t檢驗的數(shù)據(jù)
import pandas as pdfrom scipy import statsIS_t_test = pd.read_excel('E:\\IS_t_test.xlsx') Group1 = IS_t_test[IS_t_test['group']==1]['data']Group2 = IS_t_test[IS_t_test['group']==2]['data']w,p = stats.levene(*args)#levene方差齊性檢驗。levene(*args, **kwds) Perform Levene test for equal variances.如果p<0.05,則方差不齊print w,p#進行方差分析f,p = stats.f_oneway(*args)print f,p#輸出(0.019607843137254936, 0.89209916055865535)22.5771812081 0.00144238194084
2.多因素方差分析
數(shù)據(jù)是我從網(wǎng)上找的多因素方差分析的一個例子,研究區(qū)組和營養(yǎng)素對體重的影響。我做成了excel文件,需要的同學(xué)可以問我要哈~做多因素方差分析需要加載statsmodels模塊,如果電腦沒有安裝可以pip install一下。
#數(shù)據(jù)導(dǎo)入import pandas as pdMANOVA=pd.read_excel('E:\\MANOVA.xlsx')MANOVA#輸出(為了節(jié)省篇幅刪掉了中間部分的輸出結(jié)果)
id nutrient weight0 1 1 50.11 2 1 47.82 3 1 53.13 4 1 63.54 5 1 71.25 6 1 41.4.......................21 6 3 38.522 7 3 51.223 8 3 46.2
#多因素方差分析from statsmodels.formula.api import olsfrom statsmodels.stats.anova import anova_lmformula = 'weight~C(id)+C(nutrient)+C(id):C(nutrient)'anova_results = anova_lm(ols(formula,MANOVA).fit())print anova_results#output df sum_sq mean_sq F PR(>F)C(id) 7 2.373613e+03 339.087619 0 NaNC(nutrient) 2 1.456133e+02 72.806667 0 NaNC(id):C(nutrient) 14 3.391667e+02 24.226190 0 NaNResidual 0 8.077936e-27 inf NaN NaN
也許數(shù)據(jù)選得不對,p-value全是空值23333,待我找個好點兒的數(shù)據(jù)再做一次多因素方差分析。
3.重復(fù)測量設(shè)計的方差分析(單因素) ********待完善
重復(fù)測量設(shè)計是對同一因變量進行重復(fù)測度,重復(fù)測量設(shè)計的方差分析可以是同一條件下進行的重復(fù)測度,也可以是不同條件下的重復(fù)測量。
代碼和多因素方差分析一樣,思路不一樣而已~但我還找不到多因素方差分析合適的數(shù)據(jù)所以這兒就先不寫了2333
4.混合設(shè)計的方差分析 ********待完善
#########統(tǒng)計學(xué)學(xué)得好的同學(xué)們,教教我吧。。
(三)卡方檢驗
卡方檢驗就是統(tǒng)計樣本的實際觀測值與理論推斷值之間的偏離程度,實際觀測值與理論推斷值之間的偏離程度就決定卡方值的大小,卡方值越大,越不符合;卡方值越小,偏差越小,越趨于符合,若兩個值完全相等時,卡方值就為0,表明理論值完全符合。(from 百度百科2333)
1.單因素卡方檢驗
數(shù)據(jù)源于網(wǎng)絡(luò),男女化妝與不化妝人數(shù)的理論值與實際值。
import numpy as npfrom scipy import statsfrom scipy.stats import chisquareobserved = np.array([15,95]) #觀測值:110學(xué)生中化妝的女生95人,化妝的男生15人expected = np.array([55,55])#理論值:110學(xué)生中化妝的女生55人,化妝的男生55人chisquare(observed,expected)#output(58.18181818181818, 2.389775628860044e-14)
2.多因素卡方檢驗*****正在研究中,學(xué)會了完善這一塊~
(四)計數(shù)統(tǒng)計(用的數(shù)據(jù)為tips.csv)
#example:統(tǒng)計性別count = df['sex'].value_counts()#輸出print countMale 157Female 87Name: sex, dtype: int64
(五)回歸分析 *****待學(xué)習(xí): 數(shù)據(jù)擬合,廣義線性回歸。。。。等等
六.可視化
我覺得吧,其實看著excel就可以實現(xiàn)的功能為何那么復(fù)雜,excel確實夠通用夠便捷,但是處理很大數(shù)據(jù)量的話也許吃不消吧。學(xué)學(xué)python繪圖也不賴,而且講真,有的成效真的挺好看的。
(一)Seaborn
我學(xué)數(shù)據(jù)分析可視化是從學(xué)習(xí)Seaborn入門的,Seaborn是基于matplotlib的Python可視化庫,剛開始便接觸matplotlib難免有些吃力,參數(shù)多且難理解,但是慢慢來總會學(xué)會的。還有關(guān)鍵的一點是,seaborn畫出來的圖好好看。。
#基礎(chǔ)導(dǎo)入import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib as mplimport matplotlib.pyplot as plt
#小費數(shù)據(jù)真的挺好的,這兒用tips作為exampletips = sns.load_dataset('tips') #從網(wǎng)絡(luò)環(huán)境導(dǎo)入數(shù)據(jù)tips
1.lmplot函數(shù)
lmplot(x, y, data, hue=None, col=None, row=None, palette=None, col_wrap=None, size=5, aspect=1, markers='o', sharex=True, sharey=True, hue_order=None, col_order=None, row_order=None, legend=True, legend_out=True, x_estimator=None, x_bins=None, x_ci='ci', scatter=True, fit_reg=True, ci=95, n_boot=1000, units=None, order=1, logistic=False, lowess=False, robust=False, logx=False, x_partial=None, y_partial=None, truncate=False, x_jitter=None, y_jitter=None, scatter_kws=None, line_kws=None)
功能:Plot data and regression model fits across a FacetGrid.
下面就不同的例子,對lmplot的參數(shù)進行解釋
例子1. 畫出總賬單和小費回歸關(guān)系圖
用到了lmplot(x, y, data,scatter_kws)
x,y,data一目了然這兒就不多解釋了,scatter_kws和line_kws的官方解釋如下:
{scatter,line}_kws : dictionarie
Additional keyword arguments to pass to plt.scatter
andplt.plot
.
scatter為點,line為線。其實就是用字典去限定點和線的各種屬性,如例子所示,散點的顏色為灰石色,線條的顏色為印度紅,成像效果就是這樣點線顏色分離,展現(xiàn)效果很好。大家也可以換上自己想要的圖片屬性。
sns.lmplot("total_bill", "tip", tips, scatter_kws={"marker": ".", "color": "slategray"}, line_kws={"linewidth": 1, "color": "indianred"}).savefig('picture2')
另外:顏色還可以使用RGB代碼,具體對照表可以參考這個網(wǎng)站,可以自己搭配顏色:
http://www.114la.com/other/rgb.htm
marker也可以有多種樣式,具體如下:
. Point marker
, Pixel marker
o Circle marker
v Triangle down marker
^ Triangle up marker
< Triangle left marker
> Triangle right marker
1 Tripod down marker
2 Tripod up marker
3 Tripod left marker
4 Tripod right marker
s Square marker
p Pentagon marker
* Star marker
h Hexagon marker
H Rotated hexagon D Diamond marker
d Thin diamond marker
| Vertical line (vlinesymbol) marker
_ Horizontal line (hline symbol) marker
+ Plus marker
x Cross (x) marker
sns.lmplot("total_bill", "tip", tips, scatter_kws={"marker": ".","color":"#FF7F00"}, line_kws={"linewidth": 1, "color": "#BF3EFF"}).savefig('s1')
ps.我修改maker屬性不成功不知為何,求解答
例子2.用餐人數(shù)(size)和小費(tip)的關(guān)系圖
官方解釋:
x_estimator : callable that maps vector -> scalar, optional
Apply this function to each unique value of x
and plot the resulting estimate. This is useful when x
is a discrete variable. If x_ci
is not None
, this estimate will be bootstrapped and a confidence interval will be drawn.
大概解釋就是:對擁有相同x水平的y值進行映射
plt.figure()sns.lmplot('size', 'tip', tips, x_estimator= np.mean).savefig('picture3')
{x,y}_jitter : floats, optional
Add uniform random noise of this size to either the
x
ory
variables. The noise is added to a copy of the data after fitting the regression, and only influences the look of the scatterplot. This can be helpful when plotting variables that take discrete values.
jitter是個很有意思的參數(shù), 特別是處理靶數(shù)據(jù)的overlapping過于嚴重的情況時, 通過增加一定程度的噪聲(noise)實現(xiàn)數(shù)據(jù)的區(qū)隔化, 這樣原始數(shù)據(jù)是若干 點簇 變成一系列密集鄰近的點群. 另外, 有的人會經(jīng)常將
rug
與jitter
結(jié)合使用. 這依人吧.對于橫軸取離散水平的時候, 用x_jitter可以讓數(shù)據(jù)點發(fā)生水平的擾動.但擾動的幅度不宜過大。
sns.lmplot('size', 'tip', tips, x_jitter=.15).savefig('picture4')
seaborn還可以做出xkcd風格的圖片,還挺有意思的
with plt.xkcd(): sns.color_palette('husl', 8) sns.set_context('paper') sns.lmplot(x='total_bill', y='tip', data=tips, ci=65).savefig('picture1')
with plt.xkcd(): sns.lmplot('total_bill', 'tip', data=tips, hue='day') plt.xlabel('hue = day') plt.savefig('picture5')
with plt.xkcd(): sns.lmplot('total_bill', 'tip', data=tips, hue='smoker') plt.xlabel('hue = smoker') plt.savefig('picture6')
sns.set_style('dark')sns.set_context('talk')sns.lmplot('size', 'total_bill', tips, order=2)plt.title('# poly order = 2')plt.savefig('picture7')plt.figure()sns.lmplot('size', 'total_bill', tips, order=3)plt.title('# poly order = 3')plt.savefig('picture8')
sns.jointplot("total_bill", "tip", tips).savefig('picture9')
(二)matplotlib ********待完善
七.其它~
(一)調(diào)用R
讓Python直接調(diào)用R的函數(shù),下載安裝rpy2模塊即可~
具體步驟:http://www.geome.cn/posts/python-%E9%80%9A%E8%BF%87rpy2%E8%B0%83%E7%94%A8-r%E8%AF%AD%E8%A8%80/
親測可用~ 大大大大大前提:電腦上安裝了R
(二)ipython ********待完善
聯(lián)系客服