Version 0.16.2 (June 12, 2015)#

This is a minor bug-fix release from 0.16.1 and includes a large number of bug fixes along some new features (pipe() method), enhancements, and performance improvements.

We recommend that all users upgrade to this version.

Highlights include:

  • A new pipe method, see here

  • Documentation on how to use numba with pandas, see here

New features#

Pipe#

We’ve introduced a new method DataFrame.pipe(). As suggested by the name, pipe should be used to pipe data through a chain of function calls. The goal is to avoid confusing nested function calls like

# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3)  # noqa F821

The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as

(
    df.pipe(h)  # noqa F821
    .pipe(g, arg1=1)  # noqa F821
    .pipe(f, arg2=2, arg3=3)  # noqa F821
)

Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.

In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument. When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple of (function, keyword) indicating where the DataFrame should flow. For example:

In [1]: import statsmodels.formula.api as sm

In [2]: bb = pd.read_csv("data/baseball.csv", index_col="id")

# sm.ols takes (formula, data)
In [3]: (
   ...:     bb.query("h > 0")
   ...:     .assign(ln_h=lambda df: np.log(df.h))
   ...:     .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
   ...:     .fit()
   ...:     .summary()
   ...: )
   ...: 
Out[3]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     hr   R-squared:                       0.685
Model:                            OLS   Adj. R-squared:                  0.665
Method:                 Least Squares   F-statistic:                     34.28
Date:                Sat, 02 Apr 2022   Prob (F-statistic):           3.48e-15
Time:                        08:20:33   Log-Likelihood:                -205.92
No. Observations:                  68   AIC:                             421.8
Df Residuals:                      63   BIC:                             432.9
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780
C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375
ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395
year            4.2277      2.324      1.819      0.074      -0.417       8.872
g               0.1841      0.029      6.258      0.000       0.125       0.243
==============================================================================
Omnibus:                       10.875   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298
Skew:                           0.537   Prob(JB):                     0.000175
Kurtosis:                       5.225   Cond. No.                     1.49e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

The pipe method is inspired by unix pipes, which stream text through processes. More recently dplyr and magrittr have introduced the popular (%>%) pipe operator for R.

See the documentation for more. (GH10129)

Other enhancements#

  • Added rsplit to Index/Series StringMethods (GH10303)

  • Removed the hard-coded size limits on the DataFrame HTML representation in the IPython notebook, and leave this to IPython itself (only for IPython v3.0 or greater). This eliminates the duplicate scroll bars that appeared in the notebook with large frames (GH10231).

    Note that the notebook has a toggle output scrolling feature to limit the display of very large frames (by clicking left of the output). You can also configure the way DataFrames are displayed using the pandas options, see here here.

  • axis parameter of DataFrame.quantile now accepts also index and column. (GH9543)

API changes#

  • Holiday now raises NotImplementedError if both offset and observance are used in the constructor instead of returning an incorrect result (GH10217).

Performance improvements#

  • Improved Series.resample performance with dtype=datetime64[ns] (GH7754)

  • Increase performance of str.split when expand=True (GH10081)

Bug fixes#

  • Bug in Series.hist raises an error when a one row Series was given (GH10214)

  • Bug where HDFStore.select modifies the passed columns list (GH7212)

  • Bug in Categorical repr with display.width of None in Python 3 (GH10087)

  • Bug in to_json with certain orients and a CategoricalIndex would segfault (GH10317)

  • Bug where some of the nan functions do not have consistent return dtypes (GH10251)

  • Bug in DataFrame.quantile on checking that a valid axis was passed (GH9543)

  • Bug in groupby.apply aggregation for Categorical not preserving categories (GH10138)

  • Bug in to_csv where date_format is ignored if the datetime is fractional (GH10209)

  • Bug in DataFrame.to_json with mixed data types (GH10289)

  • Bug in cache updating when consolidating (GH10264)

  • Bug in mean() where integer dtypes can overflow (GH10172)

  • Bug where Panel.from_dict does not set dtype when specified (GH10058)

  • Bug in Index.union raises AttributeError when passing array-likes. (GH10149)

  • Bug in Timestamp’s’ microsecond, quarter, dayofyear, week and daysinmonth properties return np.int type, not built-in int. (GH10050)

  • Bug in NaT raises AttributeError when accessing to daysinmonth, dayofweek properties. (GH10096)

  • Bug in Index repr when using the max_seq_items=None setting (GH10182).

  • Bug in getting timezone data with dateutil on various platforms ( GH9059, GH8639, GH9663, GH10121)

  • Bug in displaying datetimes with mixed frequencies; display ‘ms’ datetimes to the proper precision. (GH10170)

  • Bug in setitem where type promotion is applied to the entire block (GH10280)

  • Bug in Series arithmetic methods may incorrectly hold names (GH10068)

  • Bug in GroupBy.get_group when grouping on multiple keys, one of which is categorical. (GH10132)

  • Bug in DatetimeIndex and TimedeltaIndex names are lost after timedelta arithmetic ( GH9926)

  • Bug in DataFrame construction from nested dict with datetime64 (GH10160)

  • Bug in Series construction from dict with datetime64 keys (GH9456)

  • Bug in Series.plot(label="LABEL") not correctly setting the label (GH10119)

  • Bug in plot not defaulting to matplotlib axes.grid setting (GH9792)

  • Bug causing strings containing an exponent, but no decimal to be parsed as int instead of float in engine='python' for the read_csv parser (GH9565)

  • Bug in Series.align resets name when fill_value is specified (GH10067)

  • Bug in read_csv causing index name not to be set on an empty DataFrame (GH10184)

  • Bug in SparseSeries.abs resets name (GH10241)

  • Bug in TimedeltaIndex slicing may reset freq (GH10292)

  • Bug in GroupBy.get_group raises ValueError when group key contains NaT (GH6992)

  • Bug in SparseSeries constructor ignores input data name (GH10258)

  • Bug in Categorical.remove_categories causing a ValueError when removing the NaN category if underlying dtype is floating-point (GH10156)

  • Bug where infer_freq infers time rule (WOM-5XXX) unsupported by to_offset (GH9425)

  • Bug in DataFrame.to_hdf() where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH9057)

  • Bug to handle masking empty DataFrame (GH10126).

  • Bug where MySQL interface could not handle numeric table/column names (GH10255)

  • Bug in read_csv with a date_parser that returned a datetime64 array of other time resolution than [ns] (GH10245)

  • Bug in Panel.apply when the result has ndim=0 (GH10332)

  • Bug in read_hdf where auto_close could not be passed (GH9327).

  • Bug in read_hdf where open stores could not be used (GH10330).

  • Bug in adding empty DataFrames, now results in a DataFrame that .equals an empty DataFrame (GH10181).

  • Bug in to_hdf and HDFStore which did not check that complib choices were valid (GH4582, GH8874).

Contributors#

A total of 34 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.

  • Andrew Rosenfeld

  • Artemy Kolchinsky

  • Bernard Willers +

  • Christer van der Meeren

  • Christian Hudon +

  • Constantine Glen Evans +

  • Daniel Julius Lasiman +

  • Evan Wright

  • Francesco Brundu +

  • Gaëtan de Menten +

  • Jake VanderPlas

  • James Hiebert +

  • Jeff Reback

  • Joris Van den Bossche

  • Justin Lecher +

  • Ka Wo Chen +

  • Kevin Sheppard

  • Mortada Mehyar

  • Morton Fox +

  • Robin Wilson +

  • Sinhrks

  • Stephan Hoyer

  • Thomas Grainger

  • Tom Ajamian

  • Tom Augspurger

  • Yoshiki Vázquez Baeza

  • Younggun Kim

  • austinc +

  • behzad nouri

  • jreback

  • lexual

  • rekcahpassyla +

  • scls19fr

  • sinhrks