因?yàn)檫@里提到的所有的庫(kù)都是開(kāi)源的,所以我們還備注了每個(gè)庫(kù)的貢獻(xiàn)資料數(shù)量、貢獻(xiàn)者人數(shù)以及其他指數(shù),可對(duì)每個(gè)Python庫(kù)的受歡迎程度加以輔助說(shuō)明。
1.NumPy
(資料數(shù)量:15980;貢獻(xiàn)者:522)
在最開(kāi)始接觸Python的時(shí)候,我們不可避免的都需要尋求Python的SciPyStack的幫助,SciPyStack是一款專(zhuān)為Python中科學(xué)計(jì)算而設(shè)計(jì)的軟件集。所以我們?cè)谥vPython庫(kù)的時(shí)候就不得不提到它了。但是SciPyStack所含內(nèi)容非常廣泛,其中包括了十幾個(gè)庫(kù),而我們需要做的是找到其中最重要的軟件包。
NumPy(代表Numerical Python)是構(gòu)建科學(xué)計(jì)算棧(scientific computation stack)的最基礎(chǔ)的軟件包。它的功能豐富,可以滿(mǎn)足Python中n數(shù)組和矩陣的操作需求。該庫(kù)提供了NumPy數(shù)組類(lèi)型的數(shù)學(xué)運(yùn)算向量化,可以改善性能,從而加快執(zhí)行速度。
2.SciPy
(資料數(shù)量:17213;貢獻(xiàn)者:489)
SciPy是一個(gè)工程和科學(xué)軟件庫(kù)。您還需要了解SciPyStack和SciPyLibrary之間的區(qū)別。SciPy包含線(xiàn)性代數(shù),優(yōu)化,集成和統(tǒng)計(jì)多個(gè)模塊。SciPyLibrary的主要功能是建立在NumPy的基礎(chǔ)上,因此它的數(shù)組大量使用NumPy。它通過(guò)其特定的子模塊提供有效的數(shù)值例程(numerical routines),如數(shù)字積分,優(yōu)化等等。SciPy的所有子模塊中功能都有詳細(xì)的記錄–這是它的另一大優(yōu)勢(shì)。
3.Pandas
(資料數(shù)量:15089;貢獻(xiàn)者:762)
Pandas是一個(gè)Python軟件包,可以處理“標(biāo)記”(labeled)和“關(guān)聯(lián)”(relational)數(shù)據(jù),簡(jiǎn)單直觀。Pandas是數(shù)據(jù)整理的完美工具。使用者可以通過(guò)它快速簡(jiǎn)便地完成數(shù)據(jù)操作,聚合和可視化。
Pandas庫(kù)有兩種主要數(shù)據(jù)結(jié)構(gòu):
“系列”(Series)——單維結(jié)構(gòu)
“數(shù)據(jù)幀”(Data Frames)——二維結(jié)構(gòu)
例如,如果你通過(guò)Series在Data Frame中附加一行數(shù)據(jù),你就能從這兩種數(shù)據(jù)結(jié)構(gòu)中獲得一個(gè)的新的“數(shù)據(jù)幀”
使用Pandas你可以完成以下操作:
輕松刪除或添加“數(shù)據(jù)幀”
bjects將數(shù)據(jù)結(jié)構(gòu)轉(zhuǎn)化成“數(shù)據(jù)幀對(duì)象”
處理缺失數(shù)據(jù),用NaNs表示
強(qiáng)大的分組功能
4.Matplotlib
(資料數(shù)量:21754;貢獻(xiàn)者:588)
MatPlotlib是SciPyStack另一個(gè)核心軟件包和Python庫(kù),可以輕松生成簡(jiǎn)單而強(qiáng)大的可視化功能。這個(gè)頂尖軟件包使得Python(有一些NumPy,SciPy和Pandas的幫助)可以與MatLab或Mathematica等科學(xué)工具的一較高下。
然而,這個(gè)庫(kù)還是相對(duì)比較低級(jí)的,這意味著你需要編寫(xiě)更多的代碼才能達(dá)到高級(jí)的可視化效果,而且通常會(huì)比使用那些高級(jí)工具要付出更多的努力,但總體來(lái)說(shuō)還是值得一試的。
你可以使用它實(shí)現(xiàn)各種可視化:
線(xiàn)路圖
散點(diǎn)圖;
條形圖和直方圖;
餅狀圖;
莖葉圖
等值線(xiàn)圖
向量場(chǎng)圖
頻譜圖
還可以使用Matplotlib創(chuàng)建標(biāo)簽,網(wǎng)格,圖例和許多其他格式化字符?;緛?lái)說(shuō),一切都是可進(jìn)行自定義的。
這個(gè)庫(kù)由很多平臺(tái)支持,并使用不同的圖形用戶(hù)界面(GUI)套件來(lái)描繪所得的可視化。很多IDE(如IPython)都支持Matplotlib的功能。
5.Seaborn
(資料數(shù)量:1699;貢獻(xiàn)者:71)
Seaborn主要關(guān)注統(tǒng)計(jì)模型的可視化,如熱圖,這些可視化圖形在總結(jié)數(shù)據(jù)的同時(shí)描繪數(shù)據(jù)的總體分布。Seaborn是基于Matplotlib的,并高度依賴(lài)于它。
6.Bokeh
(資料數(shù)量:15724;貢獻(xiàn)者:223)
Bokeh是另一個(gè)強(qiáng)大的可視化庫(kù),可以實(shí)現(xiàn)交互式可視化。與其他的庫(kù)相比,它的特別之處在于它是獨(dú)立于Matplotlib的。Bokeh的主要關(guān)注點(diǎn)是交互性,所以它可以通過(guò)現(xiàn)代瀏覽器以數(shù)據(jù)驅(qū)動(dòng)文檔(d3.js)的方式進(jìn)行演示。
7.Plotly
(資料數(shù)量:2486;貢獻(xiàn)者:33)
它是一個(gè)基于網(wǎng)絡(luò)的工具箱,可用于構(gòu)建可視化,用編程語(yǔ)言(其中包括Python)處理應(yīng)用程序界面(API)。在“plotly”網(wǎng)站上有一些強(qiáng)大的“開(kāi)箱即用”的圖形。在使用Plotly之前,您需要設(shè)置您的API密鑰。這些圖形將在服務(wù)器端上進(jìn)行處理,然后發(fā)布到互聯(lián)網(wǎng)上,當(dāng)然也可以選擇不發(fā)布。
英文原文
Top 15 Python Libraries for Data Science in 2017
As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.
And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.
Core Libraries.
1. NumPy (Commits: 15980, Contributors: 522)
When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack). This way we want to start with a look at it. However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).
The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.
2. SciPy (Commits: 17213, Contributors: 489)
SciPy is a library of software for engineering and science. Again you need to understand the difference between SciPy Stack and SciPy Library. SciPy contains modules for linear algebra, optimization, integration, and statistics. The main functionality of SciPy library is built upon NumPy, and its arrays thus make substantial use of NumPy. It provides efficient numerical routines as numerical integration, optimization, and many others via its specific submodules. The functions in all submodules of SciPy are well documented?—?another coin in its pot.
3. Pandas (Commits: 15089, Contributors: 762)
Pandas is a Python package designed to do work with “l(fā)abeled” and “relational” data simple and intuitive. Pandas is a perfect tool for data wrangling. It designed for quick and easy data manipulation, aggregation, and visualization.
There are two main data structures in the library:
“Series”?—?one-dimensional
“Data Frames”, two-dimensional
For example, when you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:
Here is just a small list of things that you can do with Pandas:
Easily delete and add columns from DataFrame
Convert data structures to DataFrame objects
Handle missing data, represents as NaNs
Powerful grouping by functionality
4.Matplotlib (Commits: 21754, Contributors: 588)
Another SciPy Stack core package and another Python Library that is tailored for the generation of simple and powerful visualizations with ease is Matplotlib. It is a top-notch piece of software which is making Python (with some help of NumPy, SciPy, and Pandas) a cognizant competitor to such scientific tools as MatLab or Mathematica.
However, the library is pretty low-level, meaning that you will need to write more code to reach the advanced levels of visualizations and you will generally put more effort, than if using more high-level tools, but the overall effort is worth a shot.
With a bit of effort you can make just about any visualizations:
Line plots;
Scatter plots;
Bar charts and Histograms;
Pie charts;
Stem plots;
Contour plots;
Quiver plots;
Spectrograms
There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.
The library is supported by different platforms and makes use of different GUI kits for the depiction of resulting visualizations. Varying IDEs (like IPython) support functionality of Matplotlib.
There are also some additional libraries that can make visualization even easier.
5. Seaborn (Commits: 1699, Contributors: 71)
Seaborn is mostly focused on the visualization of statistical models; such visualizations include heat maps, those that summarize the data but still depict the overall distributions. Seaborn is based on Matplotlib and highly dependent on that.
6. Bokeh (Commits: 15724, Contributors: 223)
Another great visualization library is Bokeh, which is aimed at interactive visualizations. In contrast to the previous library, this one is independent of Matplotlib. The main focus of Bokeh, as we already mentioned, is interactivity and it makes its presentation via modern browsers in the style of Data-Driven Documents (d3.js).
7. Plotly (Commits: 2486, Contributors: 33)
Finally, a word about Plotly. It is rather a web-based toolbox for building visualizations, exposing APIs to some programming languages (Python among them). There is a number of robust, out-of-box graphics on the plot.ly website. In order to use Plotly, you will need to set up your API key. The graphics will be processed server side and will be posted on the internet, but there is a way to avoid it.
聯(lián)系客服