可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)

In this tutorial, you’ll be equipped to make production-quality, presentation-ready Python histogram plots with a range of choices and features.
在本教程中,您将具备制作具有各种选择和功能的生产质量,可用于演示的Python直方图的能力。
If you have introductory to intermediate knowledge in Python and statistics, you can use this article as a one-stop shop for building and plotting histograms in Python using libraries from its scientific stack, including NumPy, Matplotlib, Pandas, and Seaborn.
如果您具有Python和统计方面的中级入门知识,可以将本文用作使用Python的科学堆栈中的库(包括NumPy,Matplotlib,Pandas和Seaborn)在Python中构建和绘制直方图的一站式商店。
A histogram is a great tool for quickly assessing a probability distribution that is intuitively understood by almost any audience. Python offers a handful of different options for building and plotting histograms. Most people know a histogram by its graphical representation, which is similar to a bar graph:
直方图是快速评估几乎所有受众都能直观理解的概率分布的绝佳工具。 Python提供了许多不同的选项来构建和绘制直方图。 大多数人都通过直方图的图形表示来了解直方图,这类似于条形图:

可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片
【可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)】

This article will guide you through creating plots like the one above as well as more complex ones. Here’s what you’ll cover:
本文将指导您创建上面的图以及更复杂的图。 这是您要介绍的内容:

  • Building histograms in pure Python, without use of third party libraries
  • Constructing histograms with NumPy to summarize the underlying data
  • Plotting the resulting histogram with Matplotlib, Pandas, and Seaborn
  • 使用纯Python构建直方图,无需使用第三方库
  • 使用NumPy构造直方图以汇总基础数据
  • 使用Matplotlib,Pandas和Seaborn绘制结果直方图
Free Bonus: Short on time? Click here to get access to a free two-page Python histograms cheat sheet that summarizes the techniques explained in this tutorial.
免费奖金:时间短吗? 单击此处可访问免费的两页Python直方图备忘单 ,其中总结了本教程中介绍的技术。
纯Python中的直方图 (Histograms in Pure Python) When you are preparing to plot a histogram, it is simplest to not think in terms of bins but rather to report how many times each value appears (a frequency table). A Python dictionary is well-suited for this task:
当您准备绘制直方图时,最简单的方法是不以垃圾桶的方式思考,而是报告每个值出现的次数(频率表)。 Python 字典非常适合此任务:
>>> >>># Need not be sorted, necessarily # Need not be sorted, necessarily >>> >>>a a = = (( 00 , , 11 , , 11 , , 11 , , 22 , , 33 , , 77 , , 77 , , 2323 ))>>> >>>def def count_elementscount_elements (( seqseq ) ) -> -> dictdict : : ......"""Tally elements from `seq`.""" """Tally elements from `seq`.""" ......hist hist = = {} {} ......for for i i in in seqseq : : ......histhist [[ ii ] ] = = histhist .. getget (( ii , , 00 ) ) + + 1 1 ......return return histhist>>> >>>counted counted = = count_elementscount_elements (( aa ) ) >>> >>>counted counted {0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1} {0: 1, 1: 3, 2: 1, 3: 1, 7: 2, 23: 1}

count_elements() returns a dictionary with unique elements from the sequence as keys and their frequencies (counts) as values. Within the loop over seq, hist[i] = hist.get(i, 0) + 1 says, “for each element of the sequence, increment its corresponding value in hist by 1.”
count_elements()返回一个字典,其中序列中的唯一元素作为键,而其频率(计数)作为值。 在seq循环中, hist[i] = hist.get(i, 0) + 1说:“对于序列中的每个元素,将其在hist的对应值增加1。”
In fact, this is precisely what is done by the collections.Counter class from Python’s standard library, which subclasses a Python dictionary and overrides its .update() method:
事实上,这恰恰是由做collections.Counter从Python的标准库,它的类的子类 Python字典,并覆盖其.update()方法:
You can confirm that your handmade function does virtually the same thing as collections.Counter by testing for equality between the two:
您可以通过测试两者之间的相等性来确认您的手工功能与collections.Counter实际上具有相同的功能:
>>> >>>recountedrecounted .. itemsitems () () == == countedcounted .. itemsitems () () True True

Technical Detail: The mapping from count_elements() above defaults to a more highly optimized C function if it is available. Within the Python function count_elements(), one micro-optimization you could make is to declare get = hist.get before the for-loop. This would bind a method to a variable for faster calls within the loop.
技术细节 :如果可用,从以上count_elements()的映射默认为更高优化的C函数 。 在Python函数count_elements() ,可以进行的一种微优化是在for循环之前声明get = hist.get 。 这会将方法绑定到变量,以便在循环内更快地进行调用。
It can be helpful to build simplified functions from scratch as a first step to understanding more complex ones. Let’s further reinvent the wheel a bit with an ASCII histogram that takes advantage of Python’s output formatting:
从头开始构建简化的功能对于理解更复杂的功能是有帮助的。 让我们进一步利用ASCII直方图来重塑轮子,该直方图利用Python的输出格式 :
This function creates a sorted frequency plot where counts are represented as tallies of plus (+) symbols. Calling sorted() on a dictionary returns a sorted list of its keys, and then you access the corresponding value for each with counted[k]. To see this in action, you can create a slightly larger dataset with Python’s random module:
此函数创建一个排序的频率图,其中计数表示为加号( + )符号。 在字典上调用sorted()返回其键的排序列表,然后使用counted[k]访问每个键的对应值。 要查看实际效果,您可以使用Python的random模块创建稍微更大的数据集:
>>> >>># No NumPy ... yet # No NumPy ... yet >>> >>>import import random random >>> >>>randomrandom .. seedseed (( 11 ))>>> >>>vals vals = = [[ 11 , , 33 , , 44 , , 66 , , 88 , , 99 , , 1010 ] ] >>> >>># Each number in `vals` will occur between 5 and 15 times. # Each number in `vals` will occur between 5 and 15 times. >>> >>>freq freq = = (( randomrandom .. randintrandint (( 55 , , 1515 ) ) for for _ _ in in valsvals ))>>> >>>data data = https://www.it610.com/article/= [] []>>> >>>for for ff , , v v in in zipzip (( freqfreq , , valsvals ): ): ......datadata .. extendextend ([([ vv ] ] * * ff ))>>> >>>ascii_histogramascii_histogram (( datadata ) ) 1 +++++++ 1 +++++++ 3 ++++++++++++++ 3 ++++++++++++++ 4 ++++++ 4 ++++++ 6 +++++++++ 6 +++++++++ 8 ++++++ 8 ++++++ 9 ++++++++++++ 9 ++++++++++++ 10 ++++++++++++ 10 ++++++++++++

Here, you’re simulating plucking from vals with frequencies given by freq (a generator expression). The resulting sample data repeats each value from vals a certain number of times between 5 and 15.
在这里,您正在模拟频率为freq ( 生成器表达式 )给定freq vals采摘。 所得样本数据在5到15之间重复一定次数重复vals的每个值。
Note: random.seed() is use to seed, or initialize, the underlying pseudorandom number generator (PRNG) used by random. It may sound like an oxymoron, but this is a way of making random data reproducible and deterministic. That is, if you copy the code here as is, you should get exactly the same histogram because the first call to random.randint() after seeding the generator will produce identical “random” data using the Mersenne Twister.
注意 : random.seed()是使用于种子,或初始化,底层伪随机数发生器( PRNG使用) random 。 听起来像是矛盾的话,但这是使随机数据可重现和确定性的一种方法。 也就是说,如果您按原样复制代码,则应该获得完全相同的直方图,因为在播种生成器之后对random.randint()的首次调用将使用Mersenne Twister产生相同的“随机”数据。
从基础开始:NumPy中的直方图计算 (Building Up From the Base: Histogram Calculations in NumPy) Thus far, you have been working with what could best be called “frequency tables.” But mathematically, a histogram is a mapping of bins (intervals) to frequencies. More technically, it can be used to approximate the probability density function (PDF) of the underlying variable.
到目前为止,您一直在使用最好的“频率表”进行工作。 但是在数学上,直方图是bin(间隔)到频率的映射。 从技术上讲,它可以用于近似基础变量的概率密度函数( PDF )。
Moving on from the “frequency table” above, a true histogram first “bins” the range of values and then counts the number of values that fall into each bin. This is what NumPy’s histogram() function does, and it is the basis for other functions you’ll see here later in Python libraries such as Matplotlib and Pandas.
从上面的“频率表”继续,真实的直方图首先“组合”值的范围,然后计算落入每个组合中的值的数量。 这就是NumPy的 histogram()函数的作用,它是稍后在Python库(如Matplotlib和Pandas)中将看到的其他函数的基础。
Consider a sample of floats drawn from the Laplace distribution. This distribution has fatter tails than a normal distribution and has two descriptive parameters (location and scale):
考虑一个从拉普拉斯分布中提取的浮子样本。 该分布的尾部比正态分布更胖,并且具有两个描述性参数(位置和比例):
In this case, you’re working with a continuous distribution, and it wouldn’t be very helpful to tally each float independently, down to the umpteenth decimal place. Instead, you can bin or “bucket” the data and count the observations that fall into each bin. The histogram is the resulting count of values within each bin:
在这种情况下,您正在使用连续分布,并且将每个浮动分别计算到小数点后第位并不会很有帮助。 取而代之的是,您可以对数据进行分类或“存储”,并计算落入每个分类中的观察值。 直方图是每个bin中的值的最终计数:
>>> >>>histhist , , bin_edges bin_edges = = npnp .. histogramhistogram (( dd ))>>> >>>hist hist array([ 1,0,3,4,4, 10, 13,9,2,4])array([ 1,0,3,4,4, 10, 13,9,2,4])>>> >>>bin_edges bin_edges array([ 3.217,5.199,7.181,9.163, 11.145, 13.127, 15.109, 17.091, array([ 3.217,5.199,7.181,9.163, 11.145, 13.127, 15.109, 17.091, 19.073, 21.055, 23.037]) 19.073, 21.055, 23.037])

This result may not be immediately intuitive. np.histogram() by default uses 10 equally sized bins and returns a tuple of the frequency counts and corresponding bin edges. They are edges in the sense that there will be one more bin edge than there are members of the histogram:
此结果可能不是立即直观的。 默认情况下, np.histogram()使用10个大小相等的bin,并返回频率计数和相应bin边缘的元组。 从某种意义上说,它们是边缘,即条形图的边缘比直方图的成员多:
Technical Detail: All but the last (rightmost) bin is half-open. That is, all bins but the last are [inclusive, exclusive), and the final bin is [inclusive, inclusive].
技术细节 :除最后一个(最右边)的垃圾箱外,其他所有垃圾箱都是半开的。 也就是说,除最后一个垃圾箱外,其他垃圾箱均为[包含(包括)],最后一个垃圾箱为[包含(包括)。
A very condensed breakdown of how the bins are constructed by NumPy looks like this:
关于NumPy如何构造垃圾箱的简明分解如下:
>>> >>># The leftmost and rightmost bin edges # The leftmost and rightmost bin edges >>> >>>first_edgefirst_edge , , last_edge last_edge = = aa .. minmin (), (), aa .. maxmax ()()>>> >>>n_equal_bins n_equal_bins = = 1010# NumPy's default # NumPy's default >>> >>>bin_edges bin_edges = = npnp .. linspacelinspace (( startstart == first_edgefirst_edge , , stopstop == last_edgelast_edge , , ......numnum == n_equal_bins n_equal_bins + + 11 , , endpointendpoint == TrueTrue ) ) ... ... >>> >>>bin_edges bin_edges array([ 0. ,2.3,4.6,6.9,9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ]) array([ 0. ,2.3,4.6,6.9,9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ])

The case above makes a lot of sense: 10 equally spaced bins over a peak-to-peak range of 23 means intervals of width 2.3.
上面的情况很有道理:在峰峰值范围为23的10个等间隔的条带表示宽度为2.3的间隔。
From there, the function delegates to either np.bincount() or np.searchsorted(). bincount() itself can be used to effectively construct the “frequency table” that you started off with here, with the distinction that values with zero occurrences are included:
从那里,该函数委托给np.bincount()np.searchsorted()bincount()本身可用于有效地构建您从此处开始的“频率表”,区别在于包括了出现次数为零的值:
Note: hist here is really using bins of width 1.0 rather than “discrete” counts. Hence, this only works for counting integers, not floats such as [3.9, 4.1, 4.15].
注意 :这里的hist实际上使用的是宽度为1.0的垃圾箱,而不是“离散”计数。 因此,这仅适用于计数整数,不适用于[3.9, 4.1, 4.15]浮点数。
使用Matplotlib和Pandas可视化直方图 (Visualizing Histograms with Matplotlib and Pandas) Now that you’ve seen how to build a histogram in Python from the ground up, let’s see how other Python packages can do the job for you. Matplotlib provides the functionality to visualize Python histograms out of the box with a versatile wrapper around NumPy’s histogram():
既然您已经了解了如何从头开始构建Python直方图,那么让我们看看其他Python软件包如何为您完成这项工作。 Matplotlib通过围绕NumPy的histogram()的通用包装器提供了开箱即用的可视化Python直方图的功能:
import import matplotlib.pyplot matplotlib.pyplot as as pltplt# An "interface" to matplotlib.axes.Axes.hist() method # An "interface" to matplotlib.axes.Axes.hist() method nn , , binsbins , , patches patches = = pltplt .. histhist (( xx == dd , , binsbins == 'auto''auto' , , colorcolor == '#0504aa''#0504aa' , , alphaalpha == 0.70.7 , , rwidthrwidth == 0.850.85 ) ) pltplt .. gridgrid (( axisaxis == 'y''y' , , alphaalpha == 0.750.75 ) ) pltplt .. xlabelxlabel (( 'Value''Value' ) ) pltplt .. ylabelylabel (( 'Frequency''Frequency' ) ) pltplt .. titletitle (( 'My Very Own Histogram''My Very Own Histogram' ) ) pltplt .. texttext (( 2323 , , 4545 , , rr '$mu=15, b=3$''$mu=15, b=3$' ) ) maxfreq maxfreq = = nn .. maxmax () () # Set a clean upper y-axis limit. # Set a clean upper y-axis limit. pltplt .. ylimylim (( ymaxymax == npnp .. ceilceil (( maxfreq maxfreq / / 1010 ) ) * * 10 10 if if maxfreq maxfreq % % 10 10 else else maxfreq maxfreq + + 1010 ) )



可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片

As defined earlier, a plot of a histogram uses its bin edges on the x-axis and the corresponding frequencies on the y-axis. In the chart above, passing bins='auto' chooses between two algorithms to estimate the “ideal” number of bins. At a high level, the goal of the algorithm is to choose a bin width that generates the most faithful representation of the data. For more on this subject, which can get pretty technical, check out Choosing Histogram Bins from the Astropy docs.
如前所述,直方图的图在x轴上使用其bin边缘,在y轴上使用相应的频率。 在上面的图表中,传递bins='auto'在两种算法之间进行选择,以估计“理想”数目的bins。 在较高的层次上,该算法的目标是选择一个可生成最真实数据表示形式的bin宽度。 有关此主题的更多信息(可以通过相当技术性的方法),请参阅Astropy文档中的选择直方图箱 。
Staying in Python’s scientific stack, Pandas’ Series.histogram() uses matplotlib.pyplot.hist() to draw a Matplotlib histogram of the input Series:
留在Python的科学堆栈中,Pandas的Series.histogram() 使用matplotlib.pyplot.hist()绘制输入Series的Matplotlib直方图:

可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片


pandas.DataFrame.histogram() is similar but produces a histogram for each column of data in the DataFrame.
pandas.DataFrame.histogram()相似,但是为DataFrame中的每一列数据生成一个直方图。
绘制内核密度估计(KDE) (Plotting a Kernel Density Estimate (KDE)) In this tutorial, you’ve been working with samples, statistically speaking. Whether the data is discrete or continuous, it’s assumed to be derived from a population that has a true, exact distribution described by just a few parameters.
在本教程中,从统计学的角度讲,您一直在处理样本。 无论数据是离散数据还是连续数据,都假定它是从总体上得出的,而总体上仅由几个参数来描述就具有真实,准确的分布。
A kernel density estimation (KDE) is a way to estimate the probability density function (PDF) of the random variable that “underlies” our sample. KDE is a means of data smoothing.
核密度估计(KDE)是一种估计“支撑”我们样本的随机变量的概率密度函数(PDF)的方法。 KDE是数据平滑的一种手段。
Sticking with the Pandas library, you can create and overlay density plots using plot.kde(), which is available for both Series and DataFrame objects. But first, let’s generate two distinct data samples for comparison:
坚持使用Pandas库,您可以使用plot.kde()创建和覆盖密度图,该图可用于SeriesDataFrame对象。 但首先,让我们生成两个不同的数据样本进行比较:
>>> >>># Sample from two different normal distributions # Sample from two different normal distributions >>> >>>means means = = 1010 , , 20 20 >>> >>>stdevs stdevs = = 44 , , 2 2 >>> >>>dist dist = = pdpd .. DataFrameDataFrame ( ( ......npnp .. randomrandom .. normalnormal (( locloc == meansmeans , , scalescale == stdevsstdevs , , sizesize == (( 10001000 , , 22 )), )), ......columnscolumns == [[ 'a''a' , , 'b''b' ]) ]) >>> >>>distdist .. aggagg ([([ 'min''min' , , 'max''max' , , 'mean''mean' , , 'std''std' ])]) .. roundround (( decimalsdecimals == 22 ) ) ab ab min-1.5712.46 min-1.5712.46 max25.3226.44 max25.3226.44 mean10.1219.94 mean10.1219.94 std3.941.94 std3.941.94

Now, to plot each histogram on the same Matplotlib axes:
现在,要在相同的Matplotlib轴上绘制每个直方图:

可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片


These methods leverage SciPy’s gaussian_kde(), which results in a smoother-looking PDF.
这些方法利用了SciPy的gaussian_kde() ,从而使PDF看上去更加平滑。
If you take a closer look at this function, you can see how well it approximates the “true” PDF for a relatively small sample of 1000 data points. Below, you can first build the “analytical” distribution with scipy.stats.norm(). This is a class instance that encapsulates the statistical standard normal distribution, its moments, and descriptive functions. Its PDF is “exact” in the sense that it is defined precisely as norm.pdf(x) = exp(-x**2/2) / sqrt(2*pi).
如果您仔细研究一下此功能,则可以看到它相对于1000个数据点的相对较小样本近似“真实” PDF的程度。 在下面,您可以首先使用scipy.stats.norm()构建“分析”分布。 这是一个类实例,它封装了统计标准正态分布,其矩和描述函数。 在精确定义为norm.pdf(x) = exp(-x**2/2) / sqrt(2*pi)的意义上,它的PDF是“精确的”。
Building from there, you can take a random sample of 1000 datapoints from this distribution, then attempt to back into an estimation of the PDF with scipy.stats.gaussian_kde():
从那里开始,您可以从此分布中随机抽取1000个数据点样本,然后尝试使用scipy.stats.gaussian_kde()返回PDF的估算值:
from from scipy scipy import import statsstats# An object representing the "frozen" analytical distribution # An object representing the "frozen" analytical distribution # Defaults to the standard normal distribution, N~(0, 1) # Defaults to the standard normal distribution, N~(0, 1) dist dist = = statsstats .. normnorm ()()# Draw random samples from the population you built above. # Draw random samples from the population you built above. # This is just a sample, so the mean and std. deviation should # This is just a sample, so the mean and std. deviation should # be close to (1, 0). # be close to (1, 0). samp samp = = distdist .. rvsrvs (( sizesize == 10001000 ))# `ppf()`: percent point function (inverse of cdf — percentiles). # `ppf()`: percent point function (inverse of cdf — percentiles). x x = = npnp .. linspacelinspace (( startstart == statsstats .. normnorm .. ppfppf (( 0.010.01 ), ), stopstop == statsstats .. normnorm .. ppfppf (( 0.990.99 ), ), numnum == 250250 ) ) gkde gkde = = statsstats .. gaussian_kdegaussian_kde (( datasetdataset == sampsamp ))# `gkde.evaluate()` estimates the PDF itself. # `gkde.evaluate()` estimates the PDF itself. figfig , , ax ax = = pltplt .. subplotssubplots () () axax .. plotplot (( xx , , distdist .. pdfpdf (( xx ), ), linestylelinestyle == 'solid''solid' , , cc == 'red''red' , , lwlw == 33 , , alphaalpha == 0.80.8 , , labellabel == 'Analytical (True) PDF''Analytical (True) PDF' ) ) axax .. plotplot (( xx , , gkdegkde .. evaluateevaluate (( xx ), ), linestylelinestyle == 'dashed''dashed' , , cc == 'black''black' , , lwlw == 22 , , labellabel == 'PDF Estimated via KDE''PDF Estimated via KDE' ) ) axax .. legendlegend (( locloc == 'best''best' , , frameonframeon == FalseFalse ) ) axax .. set_titleset_title (( 'Analytical vs. Estimated PDF''Analytical vs. Estimated PDF' ) ) axax .. set_ylabelset_ylabel (( 'Probability''Probability' ) ) axax .. texttext (( -- 2.2. , , 0.350.35 , , rr '$f(x) = frac{exp(-x^2/2)}{sqrt{2*pi}}$''$f(x) = frac{exp(-x^2/2)}{sqrt{2*pi}}$' , , fontsizefontsize == 1212 ) )



可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片

This is a bigger chunk of code, so let’s take a second to touch on a few key lines:
这是一大段代码,因此让我们花一点时间来谈谈一些关键行:
  • SciPy’s stats subpackage lets you create Python objects that represent analytical distributions that you can sample from to create actual data. So dist = stats.norm() represents a normal continuous random variable, and you generate random numbers from it with dist.rvs().
  • To evaluate both the analytical PDF and the Gaussian KDE, you need an array x of quantiles (standard deviations above/below the mean, for a normal distribution). stats.gaussian_kde() represents an estimated PDF that you need to evaluate on an array to produce something visually meaningful in this case.
  • The last line contains some LaTex, which integrates nicely with Matplotlib.
  • SciPy的stats子程序包使您可以创建表示分析分布的Python对象,可以从中进行采样以创建实际数据。 因此dist = stats.norm()代表一个普通的连续随机变量,您可以使用dist.rvs()从中生成随机数。
  • 要评估分析PDF和高斯KDE,您需要一个分位数的数组x (对于正态分布,均值在平均值上方/下方是标准偏差)。 stats.gaussian_kde()表示需要估算的PDF,在这种情况下,您需要在数组上进行评估才能产生有意义的视觉效果。
  • 最后一行包含一些LaTex ,它与Matplotlib很好地集成在一起。
与Seaborn的另类选择 (A Fancy Alternative with Seaborn) Let’s bring one more Python package into the mix. Seaborn has a displot() function that plots the histogram and KDE for a univariate distribution in one step. Using the NumPy array d from ealier:
让我们再添加一个Python包。 Seaborn具有一个displot()函数,可一步一步绘制直方图和KDE以获得单变量分布。 使用来自ealier的NumPy数组d

可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片


The call above produces a KDE. There is also optionality to fit a specific distribution to the data. This is different than a KDE and consists of parameter estimation for generic data and a specified distribution name:
上面的调用会生成一个KDE。 还可以选择使特定分布适合数据。 这与KDE不同,它由通用数据的参数估计和指定的分发名称组成:
snssns .. distplotdistplot (( dd , , fitfit == statsstats .. laplacelaplace , , kdekde == FalseFalse ) )



可视化|Python直方图绘图(NumPy,Matplotlib,Pandas和Seaborn)
文章图片

Again, note the slight difference. In the first case, you’re estimating some unknown PDF; in the second, you’re taking a known distribution and finding what parameters best describe it given the empirical data.
同样,请注意细微的差别。 在第一种情况下,您将估计一些未知的PDF; 在第二个中,您将获得已知分布,并根据经验数据找到最能描述该分布的参数。
熊猫的其他工具 (Other Tools in Pandas) In addition to its plotting tools, Pandas also offers a convenient .value_counts() method that computes a histogram of non-null values to a Pandas Series:
除了绘图工具外,Pandas还提供了一种方便的.value_counts()方法,该方法可以计算出Pandas Series的非空值的直方图:
Elsewhere, pandas.cut() is a convenient way to bin values into arbitrary intervals. Let’s say you have some data on ages of individuals and want to bucket them sensibly:
在其他地方, pandas.cut()是将值分成任意间隔的便捷方法。 假设您有一些有关个人年龄的数据,并希望明智地对其进行分类:
>>> >>>ages ages = = pdpd .. SeriesSeries ( ( ......[[ 11 , , 11 , , 33 , , 55 , , 88 , , 1010 , , 1212 , , 1515 , , 1818 , , 1818 , , 1919 , , 2020 , , 2525 , , 3030 , , 4040 , , 5151 , , 5252 ]) ]) >>> >>>bins bins = = (( 00 , , 1010 , , 1313 , , 1818 , , 2121 , , npnp .. infinf ))# The edges # The edges >>> >>>labels labels = = (( 'child''child' , , 'preteen''preteen' , , 'teen''teen' , , 'military_age''military_age' , , 'adult''adult' ) ) >>> >>>groupsgroups , , _ _ = = pdpd .. cutcut (( agesages , , binsbins == binsbins , , labelslabels == labelslabels , , retbinsretbins == TrueTrue ))>>> >>>groupsgroups .. value_countsvalue_counts () () child6 child6 adult5 adult5 teen3 teen3 military_age2 military_age2 preteen1 preteen1 dtype: int64dtype: int64>>> >>>pdpd .. concatconcat (((( agesages , , groupsgroups ), ), axisaxis == 11 )) .. renamerename (( columnscolumns == {{ 00 : : 'age''age' , , 11 : : 'group''group' }) }) agegroup agegroup 01child 01child 11child 11child 23child 23child 35child 35child 48child 48child 510child 510child 612preteen 612preteen 715teen 715teen 818teen 818teen 918teen 918teen 1019military_age 1019military_age 1120military_age 1120military_age 1225adult 1225adult 1330adult 1330adult 1440adult 1440adult 1551adult 1551adult 1652adult 1652adult

What’s nice is that both of these operations ultimately utilize Cython code that makes them competitive on speed while maintaining their flexibility.
令人高兴的是,这两个操作最终都使用Cython代码 ,从而使它们在速度上具有竞争力,同时保持了灵活性。
好了,那我应该用哪个呢? (Alright, So Which Should I Use?) At this point, you’ve seen more than a handful of functions and methods to choose from for plotting a Python histogram. How do they compare? In short, there is no “one-size-fits-all.” Here’s a recap of the functions and methods you’ve covered thus far, all of which relate to breaking down and representing distributions in Python:
到目前为止,您已经看到了许多用于绘制Python直方图的函数和方法可供选择。 他们如何比较? 简而言之,没有“万能的”。 这是到目前为止所介绍的函数和方法的回顾,所有这些都与分解和表示Python中的分布有关:
You Have/Want To 您有/想要 Consider Using 考虑使用 Note(s) 笔记)
Clean-cut integer data housed in a data structure such as a list, tuple, or set, and you want to create a Python histogram without importing any third party libraries. 整型整数数据位于列表,元组或集合之类的数据结构中,并且您想要创建Python直方图而不导入任何第三方库。 collections.counter()collections.counter() This is a frequency table, so it doesn’t use the concept of binning as a “true” histogram does. 这是一个频率表,因此它不像“真实”直方图那样使用合并的概念。
Large array of data, and you want to compute the “mathematical” histogram that represents bins and the corresponding frequencies. 大型数据数组,您想计算代表箱和相应频率的“数学”直方图。 np.histogram()np.histogram()np.bincount()np.bincount() np.digitize()np.digitize()
Series or SeriesDataFrame object.DataFrame对象中的表格数据。 Series.plot.hist()Series.plot.hist()DataFrame.plot.hist()DataFrame.plot.hist()Series.value_counts()Series.value_counts()cut()cut()Series.plot.kde()Series.plot.kde()DataFrame.plot.kde()DataFrame.plot.kde() visualization docs for inspiration.可视化文档以获取灵感。
Create a highly customizable, fine-tuned plot from any data structure. 从任何数据结构创建高度可定制的,经过微调的图。 pyplot.hist()pyplot.hist()np.histogram() and is the basis for Pandas’ plotting functions.np.histogram()并且是熊猫绘图功能的基础。 object-oriented framework, is great for fine-tuning the details of a histogram. This interface can take a bit of time to master, but ultimately allows you to be very precise in how any visualization is laid out.面向对象的框架 ,非常适合微调直方图的细节。 该界面可能需要花费一些时间来掌握,但最终可以让您非常精确地安排任何可视化的布局。
Pre-canned design and integration. 罐头设计和集成。 distplot()distplot() Essentially a “wrapper around a wrapper” that leverages a Matplotlib histogram internally, which in turn utilizes NumPy. 本质上是内部使用Matplotlib直方图的“包装器”,后者又利用了NumPy。
Free Bonus: Short on time? Click here to get access to a free two-page Python histograms cheat sheet that summarizes the techniques explained in this tutorial.
免费奖金:时间短吗? 单击此处可访问免费的两页Python直方图备忘单 ,其中总结了本教程中介绍的技术。
You can also find the code snippets from this article together in one script at the Real Python materials page.
您还可以在Real Python资料页面上的一个脚本中找到本文中的代码片段。
With that, good luck creating histograms in the wild. Hopefully one of the tools above will suit your needs. Whatever you do, just don’t use a pie chart.
这样,祝您好运在野外创建直方图。 希望以上工具之一能满足您的需求。 无论您做什么, 都不要使用饼图 。
翻译自: https://www.pybloggers.com/2018/07/python-histogram-plotting-numpy-matplotlib-pandas-seaborn/

    推荐阅读