Version 0.7.0 (February 9, 2012)#
New features#
- New unified merge function for efficiently performing full gamut of database / relational-algebra operations. Refactored existing join methods to use the new infrastructure, resulting in substantial performance gains (GH220, GH249, GH267) 
- New unified concatenation function for concatenating Series, DataFrame or Panel objects along an axis. Can form union or intersection of the other axes. Improves performance of - Series.appendand- DataFrame.append(GH468, GH479, GH273)
- Can pass multiple DataFrames to - DataFrame.appendto concatenate (stack) and multiple Series to- Series.appendtoo
- Can pass list of dicts (e.g., a list of JSON objects) to DataFrame constructor (GH526) 
- You can now set multiple columns in a DataFrame via - __getitem__, useful for transformation (GH342)
- Handle differently-indexed output values in - DataFrame.apply(GH498)
In [1]: df = pd.DataFrame(np.random.randn(10, 4))
In [2]: df.apply(lambda x: x.describe())
Out[2]:
               0          1          2          3
count  10.000000  10.000000  10.000000  10.000000
mean    0.190912  -0.395125  -0.731920  -0.403130
std     0.730951   0.813266   1.112016   0.961912
min    -0.861849  -2.104569  -1.776904  -1.469388
25%    -0.411391  -0.698728  -1.501401  -1.076610
50%     0.380863  -0.228039  -1.191943  -1.004091
75%     0.658444   0.057974  -0.034326   0.461706
max     1.212112   0.577046   1.643563   1.071804
[8 rows x 4 columns]
- Add - DataFrame.iterrowsmethod for efficiently iterating through the rows of a DataFrame
- Add - DataFrame.to_panelwith code adapted from- LongPanel.to_long
- Add - reindex_axismethod added to DataFrame
- Add - leveloption to binary arithmetic functions on- DataFrameand- Series
- Add - leveloption to the- reindexand- alignmethods on Series and DataFrame for broadcasting values across a level (GH542, GH552, others)
- Add attribute-based item access to - Paneland add IPython completion (GH563)
- Add - logyoption to- Series.plotfor log-scaling on the Y axis
- Add - indexand- headeroptions to- DataFrame.to_string
- Can pass multiple DataFrames to - DataFrame.jointo join on index (GH115)
- Added - justifyargument to- DataFrame.to_stringto allow different alignment of column headers
- Add - sortoption to GroupBy to allow disabling sorting of the group keys for potential speedups (GH595)
- Add Panel item access via attributes and IPython completion (GH554) 
- Implement - DataFrame.lookup, fancy-indexing analogue for retrieving values given a sequence of row and column labels (GH338)
- Can pass a list of functions to aggregate with groupby on a DataFrame, yielding an aggregated result with hierarchical columns (GH166) 
- Can call - cumminand- cummaxon Series and DataFrame to get cumulative minimum and maximum, respectively (GH647)
- value_rangeadded as utility function to get min and max of a dataframe (GH288)
- Added - encodingargument to- read_csv,- read_table,- to_csvand- from_csvfor non-ascii text (GH717)
- Added - absmethod to pandas objects
- Added - crosstabfunction for easily computing frequency tables
- Added - isinmethod to index objects
- Added - levelargument to- xsmethod of DataFrame.
API changes to integer indexing#
One of the potentially riskiest API changes in 0.7.0, but also one of the most important, was a complete review of how integer indexes are handled with regard to label-based indexing. Here is an example:
In [3]: s = pd.Series(np.random.randn(10), index=range(0, 20, 2))
In [4]: s
Out[4]:
0    -1.294524
2     0.413738
4     0.276662
6    -0.472035
8    -0.013960
10   -0.362543
12   -0.006154
14   -0.923061
16    0.895717
18    0.805244
Length: 10, dtype: float64
In [5]: s[0]
Out[5]: -1.2945235902555294
In [6]: s[2]
Out[6]: 0.41373810535784006
In [7]: s[4]
Out[7]: 0.2766617129497566
This is all exactly identical to the behavior before. However, if you ask for a
key not contained in the Series, in versions 0.6.1 and prior, Series would
fall back on a location-based lookup. This now raises a KeyError:
In [2]: s[1]
KeyError: 1
This change also has the same impact on DataFrame:
In [3]: df = pd.DataFrame(np.random.randn(8, 4), index=range(0, 16, 2))
In [4]: df
    0        1       2       3
0   0.88427  0.3363 -0.1787  0.03162
2   0.14451 -0.1415  0.2504  0.58374
4  -1.44779 -0.9186 -1.4996  0.27163
6  -0.26598 -2.4184 -0.2658  0.11503
8  -0.58776  0.3144 -0.8566  0.61941
10  0.10940 -0.7175 -1.0108  0.47990
12 -1.16919 -0.3087 -0.6049 -0.43544
14 -0.07337  0.3410  0.0424 -0.16037
In [5]: df.ix[3]
KeyError: 3
In order to support purely integer-based indexing, the following methods have been added:
| Method | Description | 
|---|---|
| 
 | Retrieve value stored at location  | 
| 
 | Alias for  | 
| 
 | Retrieve the  | 
| 
 | Retrieve the  | 
| 
 | Retrieve the value at row  | 
API tweaks regarding label-based slicing#
Label-based slicing using ix now requires that the index be sorted
(monotonic) unless both the start and endpoint are contained in the index:
In [1]: s = pd.Series(np.random.randn(6), index=list('gmkaec'))
In [2]: s
Out[2]:
g   -1.182230
m   -0.276183
k   -0.243550
a    1.628992
e    0.073308
c   -0.539890
dtype: float64
Then this is OK:
In [3]: s.ix['k':'e']
Out[3]:
k   -0.243550
a    1.628992
e    0.073308
dtype: float64
But this is not:
In [12]: s.ix['b':'h']
KeyError 'b'
If the index had been sorted, the “range selection” would have been possible:
In [4]: s2 = s.sort_index()
In [5]: s2
Out[5]:
a    1.628992
c   -0.539890
e    0.073308
g   -1.182230
k   -0.243550
m   -0.276183
dtype: float64
In [6]: s2.ix['b':'h']
Out[6]:
c   -0.539890
e    0.073308
g   -1.182230
dtype: float64
Changes to Series [] operator#
As as notational convenience, you can pass a sequence of labels or a label
slice to a Series when getting and setting values via [] (i.e. the
__getitem__ and __setitem__ methods). The behavior will be the same as
passing similar input to ix except in the case of integer indexing:
In [8]: s = pd.Series(np.random.randn(6), index=list('acegkm'))
In [9]: s
Out[9]:
a   -1.206412
c    2.565646
e    1.431256
g    1.340309
k   -1.170299
m   -0.226169
Length: 6, dtype: float64
In [10]: s[['m', 'a', 'c', 'e']]
Out[10]:
m   -0.226169
a   -1.206412
c    2.565646
e    1.431256
Length: 4, dtype: float64
In [11]: s['b':'l']
Out[11]:
c    2.565646
e    1.431256
g    1.340309
k   -1.170299
Length: 4, dtype: float64
In [12]: s['c':'k']
Out[12]:
c    2.565646
e    1.431256
g    1.340309
k   -1.170299
Length: 4, dtype: float64
In the case of integer indexes, the behavior will be exactly as before
(shadowing ndarray):
In [13]: s = pd.Series(np.random.randn(6), index=range(0, 12, 2))
In [14]: s[[4, 0, 2]]
Out[14]:
4    0.132003
0    0.410835
2    0.813850
Length: 3, dtype: float64
In [15]: s[1:5]
Out[15]:
2    0.813850
4    0.132003
6   -0.827317
8   -0.076467
Length: 4, dtype: float64
If you wish to do indexing with sequences and slicing on an integer index with
label semantics, use ix.
Other API changes#
- The deprecated - LongPanelclass has been completely removed
- If - Series.sortis called on a column of a DataFrame, an exception will now be raised. Before it was possible to accidentally mutate a DataFrame’s column by doing- df[col].sort()instead of the side-effect free method- df[col].order()(GH316)
- Miscellaneous renames and deprecations which will (harmlessly) raise - FutureWarning
- dropadded as an optional parameter to- DataFrame.reset_index(GH699)
Performance improvements#
- Cythonized GroupBy aggregations no longer presort the data, thus achieving a significant speedup (GH93). GroupBy aggregations with Python functions significantly sped up by clever manipulation of the ndarray data type in Cython (GH496). 
- Better error message in DataFrame constructor when passed column labels don’t match data (GH497) 
- Substantially improve performance of multi-GroupBy aggregation when a Python function is passed, reuse ndarray object in Cython (GH496) 
- Can store objects indexed by tuples and floats in HDFStore (GH492) 
- Don’t print length by default in Series.to_string, add - lengthoption (GH489)
- Improve Cython code for multi-groupby to aggregate without having to sort the data (GH93) 
- Improve MultiIndex reindexing speed by storing tuples in the MultiIndex, test for backwards unpickling compatibility 
- Improve column reindexing performance by using specialized Cython take function 
- Further performance tweaking of Series.__getitem__ for standard use cases 
- Avoid Index dict creation in some cases (i.e. when getting slices, etc.), regression from prior versions 
- Friendlier error message in setup.py if NumPy not installed 
- Use common set of NA-handling operations (sum, mean, etc.) in Panel class also (GH536) 
- Default name assignment when calling - reset_indexon DataFrame with a regular (non-hierarchical) index (GH476)
- Use Cythonized groupers when possible in Series/DataFrame stat ops with - levelparameter passed (GH545)
- Ported skiplist data structure to C to speed up - rolling_medianby about 5-10x in most typical use cases (GH374)
Contributors#
A total of 18 people contributed patches to this release. People with a “+” by their names contributed a patch for the first time.
- Adam Klein 
- Bayle Shanks + 
- Chris Billington + 
- Dieter Vandenbussche 
- Fabrizio Pollastri + 
- Graham Taylor + 
- Gregg Lind + 
- Josh Klein + 
- Luca Beltrame 
- Olivier Grisel + 
- Skipper Seabold 
- Thomas Kluyver 
- Thomas Wiecki + 
- Wes McKinney 
- Wouter Overmeire 
- Yaroslav Halchenko 
- fabriziop + 
- theandygross +