API¶
Dataframe¶
DataFrame (dsk, name, meta, divisions) |
Parallel Pandas DataFrame |
DataFrame.add (other[, axis, level, fill_value]) |
Addition of dataframe and other, element-wise (binary operator add). |
DataFrame.append (other) |
Append rows of other to the end of this frame, returning a new object. |
DataFrame.apply (func[, axis, broadcast, …]) |
Parallel version of pandas.DataFrame.apply |
DataFrame.assign (**kwargs) |
Assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones. |
DataFrame.astype (dtype) |
Cast a pandas object to a specified dtype dtype . |
DataFrame.categorize ([columns, index, …]) |
Convert columns of the DataFrame to category dtype. |
DataFrame.columns |
|
DataFrame.compute (**kwargs) |
Compute this dask collection |
DataFrame.corr ([method, min_periods, …]) |
Compute pairwise correlation of columns, excluding NA/null values |
DataFrame.count ([axis, split_every]) |
Count non-NA cells for each column or row. |
DataFrame.cov ([min_periods, split_every]) |
Compute pairwise covariance of columns, excluding NA/null values. |
DataFrame.cummax ([axis, skipna, out]) |
Return cumulative maximum over a DataFrame or Series axis. |
DataFrame.cummin ([axis, skipna, out]) |
Return cumulative minimum over a DataFrame or Series axis. |
DataFrame.cumprod ([axis, skipna, dtype, out]) |
Return cumulative product over a DataFrame or Series axis. |
DataFrame.cumsum ([axis, skipna, dtype, out]) |
Return cumulative sum over a DataFrame or Series axis. |
DataFrame.describe ([split_every, percentiles]) |
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. |
DataFrame.div (other[, axis, level, fill_value]) |
Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.drop (labels[, axis, errors]) |
Drop specified labels from rows or columns. |
DataFrame.drop_duplicates ([split_every, …]) |
Return DataFrame with duplicate rows removed, optionally only considering certain columns |
DataFrame.dropna ([how, subset]) |
Remove missing values. |
DataFrame.dtypes |
Return data types |
DataFrame.fillna ([value, method, limit, axis]) |
Fill NA/NaN values using the specified method |
DataFrame.floordiv (other[, axis, level, …]) |
Integer division of dataframe and other, element-wise (binary operator floordiv). |
DataFrame.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
DataFrame.groupby ([by]) |
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. |
DataFrame.head ([n, npartitions, compute]) |
First n rows of the dataset |
DataFrame.iloc |
Purely integer-location based indexing for selection by position. |
DataFrame.index |
Return dask Index instance |
DataFrame.isna () |
Detect missing values. |
DataFrame.isnull () |
Detect missing values. |
DataFrame.iterrows () |
Iterate over DataFrame rows as (index, Series) pairs. |
DataFrame.itertuples () |
Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple. |
DataFrame.join (other[, on, how, lsuffix, …]) |
Join columns with other DataFrame either on index or on a key column. |
DataFrame.known_divisions |
Whether divisions are already known |
DataFrame.loc |
Purely label-location based indexer for selection by label. |
DataFrame.map_partitions (func, *args, **kwargs) |
Apply Python function on each DataFrame partition. |
DataFrame.mask (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other. |
DataFrame.max ([axis, skipna, split_every, out]) |
This method returns the maximum of the values in the object. |
DataFrame.mean ([axis, skipna, split_every, …]) |
Return the mean of the values for the requested axis |
DataFrame.merge (right[, how, on, left_on, …]) |
Merge DataFrame objects by performing a database-style join operation by columns or indexes. |
DataFrame.min ([axis, skipna, split_every, out]) |
This method returns the minimum of the values in the object. |
DataFrame.mod (other[, axis, level, fill_value]) |
Modulo of dataframe and other, element-wise (binary operator mod). |
DataFrame.mul (other[, axis, level, fill_value]) |
Multiplication of dataframe and other, element-wise (binary operator mul). |
DataFrame.ndim |
Return dimensionality |
DataFrame.nlargest ([n, columns, split_every]) |
Return the first n rows ordered by columns in descending order. |
DataFrame.npartitions |
Return number of partitions |
DataFrame.partitions |
Slice dataframe by partitions |
DataFrame.pow (other[, axis, level, fill_value]) |
Exponential power of dataframe and other, element-wise (binary operator pow). |
DataFrame.quantile ([q, axis]) |
Approximate row-wise and precise column-wise quantiles of DataFrame |
DataFrame.query (expr, **kwargs) |
Filter dataframe with complex expression |
DataFrame.radd (other[, axis, level, fill_value]) |
Addition of dataframe and other, element-wise (binary operator radd). |
DataFrame.random_split (frac[, random_state]) |
Pseudorandomly split dataframe into different pieces row-wise |
DataFrame.rdiv (other[, axis, level, fill_value]) |
Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.rename ([index, columns]) |
Alter axes labels. |
DataFrame.repartition ([divisions, …]) |
Repartition dataframe along new divisions |
DataFrame.reset_index ([drop]) |
Reset the index to the default index. |
DataFrame.rfloordiv (other[, axis, level, …]) |
Integer division of dataframe and other, element-wise (binary operator rfloordiv). |
DataFrame.rmod (other[, axis, level, fill_value]) |
Modulo of dataframe and other, element-wise (binary operator rmod). |
DataFrame.rmul (other[, axis, level, fill_value]) |
Multiplication of dataframe and other, element-wise (binary operator rmul). |
DataFrame.rpow (other[, axis, level, fill_value]) |
Exponential power of dataframe and other, element-wise (binary operator rpow). |
DataFrame.rsub (other[, axis, level, fill_value]) |
Subtraction of dataframe and other, element-wise (binary operator rsub). |
DataFrame.rtruediv (other[, axis, level, …]) |
Floating division of dataframe and other, element-wise (binary operator rtruediv). |
DataFrame.sample ([n, frac, replace, …]) |
Random sample of items |
DataFrame.set_index (other[, drop, sorted, …]) |
Set the DataFrame index (row labels) using an existing column |
DataFrame.shape |
Return a tuple representing the dimensionality of the DataFrame. |
DataFrame.std ([axis, skipna, ddof, …]) |
Return sample standard deviation over requested axis. |
DataFrame.sub (other[, axis, level, fill_value]) |
Subtraction of dataframe and other, element-wise (binary operator sub). |
DataFrame.sum ([axis, skipna, split_every, …]) |
Return the sum of the values for the requested axis |
DataFrame.tail ([n, compute]) |
Last n rows of the dataset |
DataFrame.to_bag ([index]) |
Create Dask Bag from a Dask DataFrame |
DataFrame.to_csv (filename, **kwargs) |
Store Dask DataFrame to CSV files |
DataFrame.to_dask_array ([lengths]) |
Convert a dask DataFrame to a dask array. |
DataFrame.to_delayed ([optimize_graph]) |
Convert into a list of dask.delayed objects, one per partition. |
DataFrame.to_hdf (path_or_buf, key[, mode, …]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
DataFrame.to_json (filename, *args, **kwargs) |
See dd.to_json docstring for more information |
DataFrame.to_parquet (path, *args, **kwargs) |
Store Dask.dataframe to Parquet files |
DataFrame.to_records ([index]) |
Create Dask Array from a Dask Dataframe |
DataFrame.truediv (other[, axis, level, …]) |
Floating division of dataframe and other, element-wise (binary operator truediv). |
DataFrame.values |
Return a dask.array of the values of this dataframe |
DataFrame.var ([axis, skipna, ddof, …]) |
Return unbiased variance over requested axis. |
DataFrame.visualize ([filename, format, …]) |
Render the computation of this object’s task graph using graphviz. |
DataFrame.where (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other. |
Series¶
Series (dsk, name, meta, divisions) |
Parallel Pandas Series |
Series.add (other[, level, fill_value, axis]) |
Addition of series and other, element-wise (binary operator add). |
Series.align (other[, join, axis, fill_value]) |
Align two objects on their axes with the specified join method for each axis Index |
Series.all ([axis, skipna, split_every, out]) |
Return whether all elements are True, potentially over an axis. |
Series.any ([axis, skipna, split_every, out]) |
Return whether any element is True over requested axis. |
Series.append (other) |
Concatenate two or more Series. |
Series.apply (func[, convert_dtype, meta, args]) |
Parallel version of pandas.Series.apply |
Series.astype (dtype) |
Cast a pandas object to a specified dtype dtype . |
Series.autocorr ([lag, split_every]) |
Lag-N autocorrelation |
Series.between (left, right[, inclusive]) |
Return boolean Series equivalent to left <= series <= right. |
Series.bfill ([axis, limit]) |
Synonym for DataFrame.fillna(method='bfill') |
Series.cat |
|
Series.clear_divisions () |
Forget division information |
Series.clip ([lower, upper, out]) |
Trim values at input threshold(s). |
Series.clip_lower (threshold) |
Return copy of the input with values below a threshold truncated. |
Series.clip_upper (threshold) |
Return copy of input with values above given value(s) truncated. |
Series.compute (**kwargs) |
Compute this dask collection |
Series.copy () |
Make a copy of the dataframe |
Series.corr (other[, method, min_periods, …]) |
Compute correlation with other Series, excluding missing values |
Series.count ([split_every]) |
Return number of non-NA/null observations in the Series |
Series.cov (other[, min_periods, split_every]) |
Compute covariance with Series, excluding missing values |
Series.cummax ([axis, skipna, out]) |
Return cumulative maximum over a DataFrame or Series axis. |
Series.cummin ([axis, skipna, out]) |
Return cumulative minimum over a DataFrame or Series axis. |
Series.cumprod ([axis, skipna, dtype, out]) |
Return cumulative product over a DataFrame or Series axis. |
Series.cumsum ([axis, skipna, dtype, out]) |
Return cumulative sum over a DataFrame or Series axis. |
Series.describe ([split_every, percentiles]) |
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. |
Series.diff ([periods, axis]) |
First discrete difference of element. |
Series.div (other[, level, fill_value, axis]) |
Floating division of series and other, element-wise (binary operator truediv). |
Series.drop_duplicates ([split_every, split_out]) |
Return DataFrame with duplicate rows removed, optionally only considering certain columns |
Series.dropna () |
Return a new Series with missing values removed. |
Series.dt |
Namespace of datetime methods |
Series.dtype |
Return data type |
Series.eq (other[, level, axis]) |
Equal to of series and other, element-wise (binary operator eq). |
Series.ffill ([axis, limit]) |
Synonym for DataFrame.fillna(method='ffill') |
Series.fillna ([value, method, limit, axis]) |
Fill NA/NaN values using the specified method |
Series.first (offset) |
Convenience method for subsetting initial periods of time series data based on a date offset. |
Series.floordiv (other[, level, fill_value, axis]) |
Integer division of series and other, element-wise (binary operator floordiv). |
Series.ge (other[, level, axis]) |
Greater than or equal to of series and other, element-wise (binary operator ge). |
Series.get_partition (n) |
Get a dask DataFrame/Series representing the nth partition. |
Series.groupby ([by]) |
Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns. |
Series.gt (other[, level, axis]) |
Greater than of series and other, element-wise (binary operator gt). |
Series.head ([n, npartitions, compute]) |
First n rows of the dataset |
Series.idxmax ([axis, skipna, split_every]) |
Return index of first occurrence of maximum over requested axis. |
Series.idxmin ([axis, skipna, split_every]) |
Return index of first occurrence of minimum over requested axis. |
Series.isin (values) |
Check whether values are contained in Series. |
Series.isna () |
Detect missing values. |
Series.isnull () |
Detect missing values. |
Series.iteritems () |
Lazily iterate over (index, value) tuples |
Series.known_divisions |
Whether divisions are already known |
Series.last (offset) |
Convenience method for subsetting final periods of time series data based on a date offset. |
Series.le (other[, level, axis]) |
Less than or equal to of series and other, element-wise (binary operator le). |
Series.loc |
Purely label-location based indexer for selection by label. |
Series.lt (other[, level, axis]) |
Less than of series and other, element-wise (binary operator lt). |
Series.map (arg[, na_action, meta]) |
Map values of Series using input correspondence (a dict, Series, or function). |
Series.map_overlap (func, before, after, …) |
Apply a function to each partition, sharing rows with adjacent partitions. |
Series.map_partitions (func, *args, **kwargs) |
Apply Python function on each DataFrame partition. |
Series.mask (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other. |
Series.max ([axis, skipna, split_every, out]) |
This method returns the maximum of the values in the object. |
Series.mean ([axis, skipna, split_every, …]) |
Return the mean of the values for the requested axis |
Series.memory_usage ([index, deep]) |
Return the memory usage of the Series. |
Series.min ([axis, skipna, split_every, out]) |
This method returns the minimum of the values in the object. |
Series.mod (other[, level, fill_value, axis]) |
Modulo of series and other, element-wise (binary operator mod). |
Series.mul (other[, level, fill_value, axis]) |
Multiplication of series and other, element-wise (binary operator mul). |
Series.nbytes |
Number of bytes |
Series.ndim |
Return dimensionality |
Series.ne (other[, level, axis]) |
Not equal to of series and other, element-wise (binary operator ne). |
Series.nlargest ([n, split_every]) |
Return the largest n elements. |
Series.notnull () |
Detect existing (non-missing) values. |
Series.nsmallest ([n, split_every]) |
Return the smallest n elements. |
Series.nunique ([split_every]) |
Return number of unique elements in the object. |
Series.nunique_approx ([split_every]) |
Approximate number of unique rows. |
Series.persist (**kwargs) |
Persist this dask collection into memory |
Series.pipe (func, *args, **kwargs) |
Apply func(self, *args, **kwargs) |
Series.pow (other[, level, fill_value, axis]) |
Exponential power of series and other, element-wise (binary operator pow). |
Series.prod ([axis, skipna, split_every, …]) |
Return the product of the values for the requested axis |
Series.quantile ([q]) |
Approximate quantiles of Series |
Series.radd (other[, level, fill_value, axis]) |
Addition of series and other, element-wise (binary operator radd). |
Series.random_split (frac[, random_state]) |
Pseudorandomly split dataframe into different pieces row-wise |
Series.rdiv (other[, level, fill_value, axis]) |
Floating division of series and other, element-wise (binary operator rtruediv). |
Series.reduction (chunk[, aggregate, …]) |
Generic row-wise reductions. |
Series.repartition ([divisions, npartitions, …]) |
Repartition dataframe along new divisions |
Series.rename ([index, inplace, sorted_index]) |
Alter Series index labels or name |
Series.resample (rule[, closed, label]) |
Convenience method for frequency conversion and resampling of time series. |
Series.reset_index ([drop]) |
Reset the index to the default index. |
Series.rolling (window[, min_periods, freq, …]) |
Provides rolling transformations. |
Series.round ([decimals]) |
Round each value in a Series to the given number of decimals. |
Series.sample ([n, frac, replace, random_state]) |
Random sample of items |
Series.sem ([axis, skipna, ddof, split_every]) |
Return unbiased standard error of the mean over requested axis. |
Series.shape |
Return a tuple representing the dimensionality of a Series. |
Series.shift ([periods, freq, axis]) |
Shift index by desired number of periods with an optional time freq |
Series.size |
Size of the Series or DataFrame as a Delayed object. |
Series.std ([axis, skipna, ddof, …]) |
Return sample standard deviation over requested axis. |
Series.str |
Namespace for string methods |
Series.sub (other[, level, fill_value, axis]) |
Subtraction of series and other, element-wise (binary operator sub). |
Series.sum ([axis, skipna, split_every, …]) |
Return the sum of the values for the requested axis |
Series.to_bag ([index]) |
Craeate a Dask Bag from a Series |
Series.to_csv (filename, **kwargs) |
Store Dask DataFrame to CSV files |
Series.to_dask_array ([lengths]) |
Convert a dask DataFrame to a dask array. |
Series.to_delayed ([optimize_graph]) |
Convert into a list of dask.delayed objects, one per partition. |
Series.to_frame ([name]) |
Convert Series to DataFrame |
Series.to_hdf (path_or_buf, key[, mode, append]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
Series.to_string ([max_rows]) |
Render a string representation of the Series |
Series.to_timestamp ([freq, how, axis]) |
Cast to DatetimeIndex of timestamps, at beginning of period |
Series.truediv (other[, level, fill_value, axis]) |
Floating division of series and other, element-wise (binary operator truediv). |
Series.unique ([split_every, split_out]) |
Return Series of unique values in the object. |
Series.value_counts ([split_every, split_out]) |
Returns object containing counts of unique values. |
Series.values |
Return a dask.array of the values of this dataframe |
Series.var ([axis, skipna, ddof, …]) |
Return unbiased variance over requested axis. |
Series.visualize ([filename, format, …]) |
Render the computation of this object’s task graph using graphviz. |
Series.where (cond[, other]) |
Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other. |
Groupby Operations¶
DataFrameGroupBy.aggregate (arg[, …]) |
Aggregate using one or more operations over the specified axis. |
DataFrameGroupBy.apply (func, *args, **kwargs) |
Parallel version of pandas GroupBy.apply |
DataFrameGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values |
DataFrameGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
DataFrameGroupBy.cumprod ([axis]) |
Cumulative product for each group |
DataFrameGroupBy.cumsum ([axis]) |
Cumulative sum for each group |
DataFrameGroupBy.get_group (key) |
Constructs NDFrame from group with provided name |
DataFrameGroupBy.max ([split_every, split_out]) |
Compute max of group values |
DataFrameGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values |
DataFrameGroupBy.min ([split_every, split_out]) |
Compute min of group values |
DataFrameGroupBy.size ([split_every, split_out]) |
Compute group sizes |
DataFrameGroupBy.std ([ddof, split_every, …]) |
Compute standard deviation of groups, excluding missing values |
DataFrameGroupBy.sum ([split_every, split_out]) |
Compute sum of group values |
DataFrameGroupBy.var ([ddof, split_every, …]) |
Compute variance of groups, excluding missing values |
DataFrameGroupBy.first ([split_every, split_out]) |
Compute first of group values |
DataFrameGroupBy.last ([split_every, split_out]) |
Compute last of group values |
SeriesGroupBy.aggregate (arg[, split_every, …]) |
Aggregate using one or more operations over the specified axis. |
SeriesGroupBy.apply (func, *args, **kwargs) |
Parallel version of pandas GroupBy.apply |
SeriesGroupBy.count ([split_every, split_out]) |
Compute count of group, excluding missing values |
SeriesGroupBy.cumcount ([axis]) |
Number each item in each group from 0 to the length of that group - 1. |
SeriesGroupBy.cumprod ([axis]) |
Cumulative product for each group |
SeriesGroupBy.cumsum ([axis]) |
Cumulative sum for each group |
SeriesGroupBy.get_group (key) |
Constructs NDFrame from group with provided name |
SeriesGroupBy.max ([split_every, split_out]) |
Compute max of group values |
SeriesGroupBy.mean ([split_every, split_out]) |
Compute mean of groups, excluding missing values |
SeriesGroupBy.min ([split_every, split_out]) |
Compute min of group values |
SeriesGroupBy.nunique ([split_every, split_out]) |
|
SeriesGroupBy.size ([split_every, split_out]) |
Compute group sizes |
SeriesGroupBy.std ([ddof, split_every, split_out]) |
Compute standard deviation of groups, excluding missing values |
SeriesGroupBy.sum ([split_every, split_out]) |
Compute sum of group values |
SeriesGroupBy.var ([ddof, split_every, split_out]) |
Compute variance of groups, excluding missing values |
SeriesGroupBy.first ([split_every, split_out]) |
Compute first of group values |
SeriesGroupBy.last ([split_every, split_out]) |
Compute last of group values |
Rolling Operations¶
rolling.map_overlap (func, df, before, after, …) |
Apply a function to each partition, sharing rows with adjacent partitions. |
Series.rolling (window[, min_periods, freq, …]) |
Provides rolling transformations. |
DataFrame.rolling (window[, min_periods, …]) |
Provides rolling transformations. |
Rolling.apply (func[, args, kwargs]) |
rolling function apply |
Rolling.count () |
The rolling count of any non-NaN observations inside the window. |
Rolling.kurt () |
Calculate unbiased rolling kurtosis. |
Rolling.max () |
rolling maximum |
Rolling.mean () |
Calculate the rolling mean of the values. |
Rolling.median () |
Calculate the rolling median. |
Rolling.min () |
Calculate the rolling minimum. |
Rolling.quantile (quantile) |
rolling quantile. |
Rolling.skew () |
Unbiased rolling skewness |
Rolling.std ([ddof]) |
Calculate rolling standard deviation. |
Rolling.sum () |
Calculate rolling sum of given DataFrame or Series. |
Rolling.var ([ddof]) |
Calculate unbiased rolling variance. |
Create DataFrames¶
read_csv (urlpath[, blocksize, collection, …]) |
Read CSV files into a Dask.DataFrame |
read_table (urlpath[, blocksize, collection, …]) |
Read delimited files into a Dask.DataFrame |
read_parquet (path[, columns, filters, …]) |
Read ParquetFile into a Dask DataFrame |
read_hdf (pattern, key[, start, stop, …]) |
Read HDF files into a Dask DataFrame |
read_json (url_path[, orient, lines, …]) |
Create a dataframe from a set of JSON files |
read_orc (path[, columns, storage_options]) |
Read dataframe from ORC file(s) |
read_sql_table (table, uri, index_col[, …]) |
Create dataframe from an SQL table. |
from_array (x[, chunksize, columns]) |
Read any slicable array into a Dask Dataframe |
from_bcolz (x[, chunksize, categorize, …]) |
Read BColz CTable into a Dask Dataframe |
from_dask_array (x[, columns, index]) |
Create a Dask DataFrame from a Dask Array. |
from_delayed (dfs[, meta, divisions, prefix]) |
Create Dask DataFrame from many Dask Delayed objects |
from_pandas (data[, npartitions, chunksize, …]) |
Construct a Dask DataFrame from a Pandas DataFrame |
dask.bag.core.Bag.to_dataframe ([meta, columns]) |
Create Dask Dataframe from a Dask Bag. |
Store DataFrames¶
to_csv (df, filename[, name_function, …]) |
Store Dask DataFrame to CSV files |
to_parquet (df, path[, engine, compression, …]) |
Store Dask.dataframe to Parquet files |
to_hdf (df, path, key[, mode, append, …]) |
Store Dask Dataframe to Hierarchical Data Format (HDF) files |
to_records (df) |
Create Dask Array from a Dask Dataframe |
to_bag (df[, index]) |
Create Dask Bag from a Dask DataFrame |
to_json (df, url_path[, orient, lines, …]) |
Write dataframe into JSON text files |
Covert DataFrames¶
to_dask_array |
|
to_delayed |
DataFrame Methods¶
-
class
dask.dataframe.
DataFrame
(dsk, name, meta, divisions)¶ Parallel Pandas DataFrame
Do not use this class directly. Instead use functions like
dd.read_csv
,dd.read_parquet
, ordd.from_pandas
.Parameters: dsk: dict
The dask graph to compute this DataFrame
name: str
The key prefix that specifies which keys in the dask comprise this particular DataFrame
meta: pandas.DataFrame
An empty
pandas.DataFrame
with names, dtypes, and index matching the expected output.divisions: tuple of index values
Values along which we partition our blocks on the index
-
abs
()¶ Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns: abs
Series/DataFrame containing the absolute value of each element.
See also
numpy.absolute
- calculate the absolute value element-wise.
Notes
For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
Absolute numeric values in a Series.
>>> s = pd.Series([-1.10, 2, -3.33, 4]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64
Absolute numeric values in a Series with complex numbers.
>>> s = pd.Series([1.2 + 1j]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1.56205 dtype: float64
Absolute numeric values in a Series with a Timedelta element.
>>> s = pd.Series([pd.Timedelta('1 days')]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1 days dtype: timedelta64[ns]
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({ # doctest: +SKIP ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df # doctest: +SKIP a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] # doctest: +SKIP a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
-
add
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
>>> a = pd.DataFrame([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'], ... columns=['one']) >>> a one a 1.0 b 1.0 c 1.0 d NaN >>> b = pd.DataFrame(dict(one=[1, np.nan, 1, np.nan], ... two=[np.nan, 2, np.nan, 2]), ... index=['a', 'b', 'd', 'e']) >>> b one two a 1.0 NaN b NaN 2.0 d 1.0 NaN e NaN 2.0 >>> a.add(b, fill_value=0) one two a 2.0 NaN b 1.0 2.0 c 1.0 NaN d 1.0 NaN e NaN 2.0
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two objects on their axes with the specified join method for each axis Index
Parameters: other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value
method : str, default None
limit : int, default None
fill_axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Filling axis, method and limit
broadcast_axis : {0 or ‘index’, 1 or ‘columns’}, default None
Broadcast values along this axis, if aligning two objects of different dimensions
Returns: (left, right) : (DataFrame, type of other)
Aligned objects
-
all
(axis=None, skipna=True, split_every=False, out=None)¶ Return whether all elements are True, potentially over an axis.
Returns True if all elements within a series or along a Dataframe axis are non-zero, not-empty or not-False.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
- None : reduce all axes, return a scalar.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
**kwargs : any, default None
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: all : Series or DataFrame (if level specified)
See also
pandas.Series.all
- Return True if all elements are True
pandas.DataFrame.any
- Return True if one (or more) elements are True
Examples
Series
>>> pd.Series([True, True]).all() # doctest: +SKIP True >>> pd.Series([True, False]).all() # doctest: +SKIP False
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) # doctest: +SKIP >>> df # doctest: +SKIP col1 col2 0 True True 1 True False
Default behaviour checks if column-wise values all return True.
>>> df.all() # doctest: +SKIP col1 True col2 False dtype: bool
Specify
axis='columns'
to check if row-wise values all return True.>>> df.all(axis='columns') # doctest: +SKIP 0 True 1 False dtype: bool
Or
axis=None
for whether every value is True.>>> df.all(axis=None) # doctest: +SKIP False
-
any
(axis=None, skipna=True, split_every=False, out=None)¶ Return whether any element is True over requested axis.
Unlike
DataFrame.all()
, this performs an or operation. If any of the values along the specified axis is True, this will return True.Parameters: axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
- None : reduce all axes, return a scalar.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
**kwargs : any, default None
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: any : Series or DataFrame (if level specified)
See also
pandas.DataFrame.all
- Return whether all elements are True.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([True, False]).any() # doctest: +SKIP True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) # doctest: +SKIP >>> df # doctest: +SKIP A B C 0 1 0 0 1 2 2 0
>>> df.any() # doctest: +SKIP A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 True 1 1 False 2
>>> df.any(axis='columns') # doctest: +SKIP 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 True 1 1 False 0
>>> df.any(axis='columns') # doctest: +SKIP 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None
.>>> df.any(axis=None) # doctest: +SKIP True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() # doctest: +SKIP Series([], dtype: bool)
-
append
(other)¶ Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
Parameters: other : DataFrame or Series/dict-like object, or list of these
The data to append.
ignore_index : boolean, default False
If True, do not use the index labels.
verify_integrity : boolean, default False
If True, raise ValueError on creating index with duplicates.
sort : boolean, default None
Sort columns if the columns of self and other are not aligned. The default sorting is deprecated and will change to not-sorting in a future version of pandas. Explicitly pass
sort=True
to silence the warning and sort. Explicitly passsort=False
to silence the warning and not sort.New in version 0.23.0.
Returns: appended : DataFrame
See also
pandas.concat
- General function to concatenate DataFrame, Series or Panel objects
Notes
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.
Examples
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) # doctest: +SKIP >>> df.append(df2) # doctest: +SKIP A B 0 1 2 1 3 4 0 5 6 1 7 8
With ignore_index set to True:
>>> df.append(df2, ignore_index=True) # doctest: +SKIP A B 0 1 2 1 3 4 2 5 6 3 7 8
The following, while not recommended methods for generating DataFrames, show two ways to generate a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A']) # doctest: +SKIP >>> for i in range(5): # doctest: +SKIP ... df = df.append({'A': i}, ignore_index=True) >>> df # doctest: +SKIP A 0 0 1 1 2 2 3 3 4 4
More efficient:
>>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], # doctest: +SKIP ... ignore_index=True) A 0 0 1 1 2 2 3 3 4 4
-
apply
(func, axis=0, broadcast=None, raw=False, reduce=None, args=(), meta='__no_default__', **kwds)¶ Parallel version of pandas.DataFrame.apply
This mimics the pandas version except for the following:
- Only
axis=1
is supported (and must be specified explicitly). - The user should provide output metadata via the meta keyword.
Parameters: func : function
Function to apply to each column/row
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’: apply function to each column (NOT SUPPORTED)
- 1 or ‘columns’: apply function to each row
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.args : tuple
Positional arguments to pass to function in addition to the array/series
Additional keyword arguments will be passed as keywords to the function
Returns: applied : Series or DataFrame
See also
dask.DataFrame.map_partitions
Examples
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
Apply a function to row-wise passing in extra arguments in
args
andkwargs
:>>> def myadd(row, a, b=1): ... return row.sum() + a + b >>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5)
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ddf.apply(myadd, axis=1, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.apply(lambda row: row + 1, axis=1, meta=ddf)
- Only
-
applymap
(func, meta='__no_default__')¶ Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
Parameters: func : callable
Python function, returns a single value from a single value.
Returns: DataFrame
Transformed DataFrame.
See also
DataFrame.apply
- Apply a function along input axis of DataFrame
Examples
>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]]) # doctest: +SKIP >>> df # doctest: +SKIP 0 1 0 1.000 2.120 1 3.356 4.567
>>> df.applymap(lambda x: len(str(x))) # doctest: +SKIP 0 1 0 3 4 1 5 5
Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.
>>> df.applymap(lambda x: x**2) # doctest: +SKIP 0 1 0 1.000000 4.494400 1 11.262736 20.857489
But it’s better to avoid applymap in that case.
>>> df ** 2 # doctest: +SKIP 0 1 0 1.000000 4.494400 1 11.262736 20.857489
-
assign
(**kwargs)¶ Assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones. Existing columns that are re-assigned will be overwritten.
Parameters: kwargs : keyword, value pairs
keywords are the column names. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
Returns: df : DataFrame
A new DataFrame with the new columns in addition to all the existing columns.
Notes
Assigning multiple columns within the same
assign
is possible. For Python 3.6 and above, later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not specified, you cannot refer to newly created or modified columns. All items are computed first, and then assigned in alphabetical order.Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.
Examples
>>> df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)}) # doctest: +SKIP
Where the value is a callable, evaluated on df:
>>> df.assign(ln_A = lambda x: np.log(x.A)) # doctest: +SKIP A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
Where the value already exists and is inserted:
>>> newcol = np.log(df['A']) # doctest: +SKIP >>> df.assign(ln_A=newcol) # doctest: +SKIP A B ln_A 0 1 0.426905 0.000000 1 2 -0.780949 0.693147 2 3 -0.418711 1.098612 3 4 -0.269708 1.386294 4 5 -0.274002 1.609438 5 6 -0.500792 1.791759 6 7 1.649697 1.945910 7 8 -1.495604 2.079442 8 9 0.549296 2.197225 9 10 -0.758542 2.302585
Where the keyword arguments depend on each other
>>> df = pd.DataFrame({'A': [1, 2, 3]}) # doctest: +SKIP
>>> df.assign(B=df.A, C=lambda x:x['A']+ x['B']) # doctest: +SKIP A B C 0 1 1 2 1 2 2 4 2 3 3 6
-
astype
(dtype)¶ Cast a pandas object to a specified dtype
dtype
.Parameters: dtype : data type, or dict of column name -> data type
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
copy : bool, default True.
Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects).errors : {‘raise’, ‘ignore’}, default ‘raise’.
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
New in version 0.20.0.
raise_on_error : raise on invalid input
Deprecated since version 0.20.0: Use
errors
insteadkwargs : keyword arguments to pass on to the constructor
Returns: casted : type of caller
See also
pandas.to_datetime
- Convert argument to datetime.
pandas.to_timedelta
- Convert argument to timedelta.
pandas.to_numeric
- Convert argument to a numeric type.
numpy.ndarray.astype
- Cast a numpy array to a specified type.
Examples
>>> ser = pd.Series([1, 2], dtype='int32') # doctest: +SKIP >>> ser # doctest: +SKIP 0 1 1 2 dtype: int32 >>> ser.astype('int64') # doctest: +SKIP 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') # doctest: +SKIP 0 1 1 2 dtype: category Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> ser.astype('category', ordered=True, categories=[2, 1]) # doctest: +SKIP 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
Note that using
copy=False
and changing data on a new pandas object may propagate changes:>>> s1 = pd.Series([1,2]) # doctest: +SKIP >>> s2 = s1.astype('int64', copy=False) # doctest: +SKIP >>> s2[0] = 10 # doctest: +SKIP >>> s1 # note that s1[0] has changed too # doctest: +SKIP 0 10 1 2 dtype: int64
-
bfill
(axis=None, limit=None)¶ Synonym for
DataFrame.fillna(method='bfill')
-
categorize
(columns=None, index=None, split_every=None, **kwargs)¶ Convert columns of the DataFrame to category dtype.
Parameters: columns : list, optional
A list of column names to convert to categoricals. By default any column with an object dtype is converted to a categorical, and any unknown categoricals are made known.
index : bool, optional
Whether to categorize the index. By default, object indices are converted to categorical, and unknown categorical indices are made known. Set True to always categorize the index, False to never.
split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 16.
kwargs
Keyword arguments are passed on to compute.
-
clear_divisions
()¶ Forget division information
-
clip
(lower=None, upper=None, out=None)¶ Trim values at input threshold(s).
Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.
Parameters: lower : float or array_like, default None
Minimum threshold value. All values below this threshold will be set to it.
upper : float or array_like, default None
Maximum threshold value. All values above this threshold will be set to it.
axis : int or string axis name, optional
Align object with lower and upper along the given axis.
inplace : boolean, default False
Whether to perform the operation in place on the data.
New in version 0.21.0.
*args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with numpy.
Returns: Series or DataFrame
Same type as calling object with the values outside the clip boundaries replaced
See also
clip_lower
- Clip values below specified threshold(s).
clip_upper
- Clip values above specified threshold(s).
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]} # doctest: +SKIP >>> df = pd.DataFrame(data) # doctest: +SKIP >>> df # doctest: +SKIP col_0 col_1 0 9 -2 1 -3 -7 2 0 6 3 -1 8 4 5 -5
Clips per column using lower and upper thresholds:
>>> df.clip(-4, 6) # doctest: +SKIP col_0 col_1 0 6 -2 1 -3 -4 2 0 6 3 -1 6 4 5 -4
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3]) # doctest: +SKIP >>> t # doctest: +SKIP 0 2 1 -4 2 -1 3 6 4 3 dtype: int64
>>> df.clip(t, t + 4, axis=0) # doctest: +SKIP col_0 col_1 0 6 2 1 -3 -4 2 0 3 3 6 8 4 5 3
-
clip_lower
(threshold)¶ Return copy of the input with values below a threshold truncated.
Parameters: threshold : numeric or array-like
Minimum value allowed. All values below threshold will be set to this value.
- float : every value is compared to threshold.
- array-like : The shape of threshold should match the object
it’s compared to. When self is a Series, threshold should be
the length. When self is a DataFrame, threshold should 2-D
and the same shape as self for
axis=None
, or 1-D and the same length as the axis being compared.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Align self with threshold along the given axis.
inplace : boolean, default False
Whether to perform the operation in place on the data.
New in version 0.21.0.
Returns: clipped : same type as input
See also
Series.clip
- Return copy of input with values below and above thresholds truncated.
Series.clip_upper
- Return copy of input with values above threshold truncated.
Examples
Series single threshold clipping:
>>> s = pd.Series([5, 6, 7, 8, 9]) # doctest: +SKIP >>> s.clip_lower(8) # doctest: +SKIP 0 8 1 8 2 8 3 8 4 9 dtype: int64
Series clipping element-wise using an array of thresholds. threshold should be the same length as the Series.
>>> elemwise_thresholds = [4, 8, 7, 2, 5] # doctest: +SKIP >>> s.clip_lower(elemwise_thresholds) # doctest: +SKIP 0 5 1 8 2 7 3 8 4 9 dtype: int64
DataFrames can be compared to a scalar.
>>> df = pd.DataFrame({"A": [1, 3, 5], "B": [2, 4, 6]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 1 2 1 3 4 2 5 6
>>> df.clip_lower(3) # doctest: +SKIP A B 0 3 3 1 3 4 2 5 6
Or to an array of values. By default, threshold should be the same shape as the DataFrame.
>>> df.clip_lower(np.array([[3, 4], [2, 2], [6, 2]])) # doctest: +SKIP A B 0 3 4 1 3 4 2 6 6
Control how threshold is broadcast with axis. In this case threshold should be the same length as the axis specified by axis.
>>> df.clip_lower(np.array([3, 3, 5]), axis='index') # doctest: +SKIP A B 0 3 3 1 3 4 2 5 6
>>> df.clip_lower(np.array([4, 5]), axis='columns') # doctest: +SKIP A B 0 4 5 1 4 5 2 5 6
-
clip_upper
(threshold)¶ Return copy of input with values above given value(s) truncated.
Parameters: threshold : float or array_like
axis : int or string axis name, optional
Align object with threshold along the given axis.
inplace : boolean, default False
Whether to perform the operation in place on the data
New in version 0.21.0.
Returns: clipped : same type as input
See also
-
combine
(other, func, fill_value=None, overwrite=True)¶ Add two DataFrame objects and do not propagate NaN values, so if for a (column, time) one frame is missing a value, it will default to the other frame’s value (which might be NaN as well)
Parameters: other : DataFrame
func : function
Function that takes two series as inputs and return a Series or a scalar
fill_value : scalar value
overwrite : boolean, default True
If True then overwrite values for common keys in the calling frame
Returns: result : DataFrame
See also
DataFrame.combine_first
- Combine two DataFrame objects and default to non-null values in frame calling the method
Examples
>>> df1 = DataFrame({'A': [0, 0], 'B': [4, 4]}) # doctest: +SKIP >>> df2 = DataFrame({'A': [1, 1], 'B': [3, 3]}) # doctest: +SKIP >>> df1.combine(df2, lambda s1, s2: s1 if s1.sum() < s2.sum() else s2) # doctest: +SKIP A B 0 0 3 1 0 3
-
combine_first
(other)¶ Combine two DataFrame objects and default to non-null values in frame calling the method. Result index columns will be the union of the respective indexes and columns
Parameters: other : DataFrame Returns: combined : DataFrame See also
DataFrame.combine
- Perform series-wise operation on two DataFrames using a given function
Examples
df1’s values prioritized, use values from df2 to fill holes:
>>> df1 = pd.DataFrame([[1, np.nan]]) # doctest: +SKIP >>> df2 = pd.DataFrame([[3, 4]]) # doctest: +SKIP >>> df1.combine_first(df2) # doctest: +SKIP 0 1 0 1 4.0
-
compute
(**kwargs)¶ Compute this dask collection
This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a
numpy.array()
and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.Parameters: scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs
Extra keywords to forward to the scheduler function.
See also
dask.base.compute
-
copy
()¶ Make a copy of the dataframe
This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data
-
corr
(method='pearson', min_periods=None, split_every=False)¶ Compute pairwise correlation of columns, excluding NA/null values
Parameters: method : {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result. Currently only available for pearson and spearman correlation
Returns: y : DataFrame
-
count
(axis=None, split_every=False)¶ Count non-NA cells for each column or row.
The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
level : int or str, optional
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.
numeric_only : boolean, default False
Include only float, int or boolean data.
Returns: Series or DataFrame
For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.
See also
Series.count
- number of non-NA elements in a Series
DataFrame.shape
- number of DataFrame rows and columns (including NA elements)
DataFrame.isna
- boolean same-sized DataFrame showing places of NA elements
Examples
Constructing DataFrame from a dictionary:
>>> df = pd.DataFrame({"Person": # doctest: +SKIP ... ["John", "Myla", None, "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df # doctest: +SKIP Person Age Single 0 John 24.0 False 1 Myla NaN True 2 None 21.0 True 3 John 33.0 True 4 Myla 26.0 False
Notice the uncounted NA values:
>>> df.count() # doctest: +SKIP Person 4 Age 4 Single 5 dtype: int64
Counts for each row:
>>> df.count(axis='columns') # doctest: +SKIP 0 3 1 2 2 2 3 3 4 3 dtype: int64
Counts for one level of a MultiIndex:
>>> df.set_index(["Person", "Single"]).count(level="Person") # doctest: +SKIP Age Person John 2 Myla 1
-
cov
(min_periods=None, split_every=False)¶ Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as
NaN
.This method is generally used for the analysis of time series data to understand the relationship between different measures across time.
Parameters: min_periods : int, optional
Minimum number of observations required per pair of columns to have a valid result.
Returns: DataFrame
The covariance matrix of the series of the DataFrame.
See also
pandas.Series.cov
- compute covariance with another Series
pandas.core.window.EWM.cov
- expoential weighted sample covariance
pandas.core.window.Expanding.cov
- expanding sample covariance
pandas.core.window.Rolling.cov
- rolling sample covariance
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-1.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.
Examples
>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], # doctest: +SKIP ... columns=['dogs', 'cats']) >>> df.cov() # doctest: +SKIP dogs cats dogs 0.666667 -1.000000 cats -1.000000 1.666667
>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(1000, 5), # doctest: +SKIP ... columns=['a', 'b', 'c', 'd', 'e']) >>> df.cov() # doctest: +SKIP a b c d e a 0.998438 -0.020161 0.059277 -0.008943 0.014144 b -0.020161 1.059352 -0.008543 -0.024738 0.009826 c 0.059277 -0.008543 1.010670 -0.001486 -0.000271 d -0.008943 -0.024738 -0.001486 0.921297 -0.013692 e 0.014144 0.009826 -0.000271 -0.013692 0.977795
Minimum number of periods
This method also supports an optional
min_periods
keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:>>> np.random.seed(42) # doctest: +SKIP >>> df = pd.DataFrame(np.random.randn(20, 3), # doctest: +SKIP ... columns=['a', 'b', 'c']) >>> df.loc[df.index[:5], 'a'] = np.nan # doctest: +SKIP >>> df.loc[df.index[5:10], 'b'] = np.nan # doctest: +SKIP >>> df.cov(min_periods=12) # doctest: +SKIP a b c a 0.316741 NaN -0.150812 b NaN 1.248003 0.191417 c -0.150812 0.191417 0.895202
-
cummax
(axis=None, skipna=True, out=None)¶ Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cummax : Series or DataFrame
See also
pandas.core.window.Expanding.max
- Similar functionality but ignores
NaN
values. DataFrame.max
- Return the maximum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummax(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummax() # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1
>>> df.cummax(axis=1) # doctest: +SKIP A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
-
cummin
(axis=None, skipna=True, out=None)¶ Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cummin : Series or DataFrame
See also
pandas.core.window.Expanding.min
- Similar functionality but ignores
NaN
values. DataFrame.min
- Return the minimum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() # doctest: +SKIP 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummin(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummin() # doctest: +SKIP A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1
>>> df.cummin(axis=1) # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
-
cumprod
(axis=None, skipna=True, dtype=None, out=None)¶ Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cumprod : Series or DataFrame
See also
pandas.core.window.Expanding.prod
- Similar functionality but ignores
NaN
values. DataFrame.prod
- Return the product over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() # doctest: +SKIP 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumprod(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumprod() # doctest: +SKIP A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1
>>> df.cumprod(axis=1) # doctest: +SKIP A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
-
cumsum
(axis=None, skipna=True, dtype=None, out=None)¶ Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cumsum : Series or DataFrame
See also
pandas.core.window.Expanding.sum
- Similar functionality but ignores
NaN
values. DataFrame.sum
- Return the sum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() # doctest: +SKIP 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumsum(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumsum() # doctest: +SKIP A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1
>>> df.cumsum(axis=1) # doctest: +SKIP A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
-
describe
(split_every=False, percentiles=None)¶ Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Parameters: percentiles : list-like of numbers, optional
The percentiles to include in the output. All should fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.include : ‘all’, list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored for
Series
. Here are the options:- ‘all’ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
- None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,
A black list of data types to omit from the result. Ignored for
Series
. Here are the options:- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
- None (default) : The result will exclude nothing.
Returns: summary: Series/DataFrame of summary statistics
See also
DataFrame.count
,DataFrame.max
,DataFrame.min
,DataFrame.mean
,DataFrame.std
,DataFrame.select_dtypes
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DataFrame
are analyzed for the output. The parameters are ignored when analyzing aSeries
.Examples
Describing a numeric
Series
.>>> s = pd.Series([1, 2, 3]) # doctest: +SKIP >>> s.describe() # doctest: +SKIP count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing a categorical
Series
.>>> s = pd.Series(['a', 'a', 'b', 'c']) # doctest: +SKIP >>> s.describe() # doctest: +SKIP count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series
.>>> s = pd.Series([ # doctest: +SKIP ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe() # doctest: +SKIP count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame
. By default only numeric fields are returned.>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'], # doctest: +SKIP ... 'numeric': [1, 2, 3], ... 'categorical': pd.Categorical(['d','e','f']) ... }) >>> df.describe() # doctest: +SKIP numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrame
regardless of data type.>>> df.describe(include='all') # doctest: +SKIP categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrame
by accessing it as an attribute.>>> df.numeric.describe() # doctest: +SKIP count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFrame
description.>>> df.describe(include=[np.number]) # doctest: +SKIP numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFrame
description.>>> df.describe(include=[np.object]) # doctest: +SKIP object count 3 unique 3 top c freq 1
Including only categorical columns from a
DataFrame
description.>>> df.describe(include=['category']) # doctest: +SKIP categorical count 3 unique 3 top f freq 1
Excluding numeric columns from a
DataFrame
description.>>> df.describe(exclude=[np.number]) # doctest: +SKIP categorical object count 3 3 unique 3 3 top f c freq 1 1
Excluding object columns from a
DataFrame
description.>>> df.describe(exclude=[np.object]) # doctest: +SKIP categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
-
diff
(periods=1, axis=0)¶ First discrete difference of element.
Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).
Parameters: periods : int, default 1
Periods to shift for calculating difference, accepts negative values.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Take difference over rows (0) or columns (1).
New in version 0.16.1..
Returns: diffed : DataFrame
See also
Series.diff
- First discrete difference for a Series.
DataFrame.pct_change
- Percent change over given number of periods.
DataFrame.shift
- Shift index by desired number of periods with an optional time freq.
Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], # doctest: +SKIP ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df # doctest: +SKIP a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() # doctest: +SKIP a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) # doctest: +SKIP a b c 0 NaN 0.0 0.0 1 NaN -1.0 3.0 2 NaN -1.0 7.0 3 NaN -1.0 13.0 4 NaN 0.0 20.0 5 NaN 2.0 28.0
Difference with 3rd previous row
>>> df.diff(periods=3) # doctest: +SKIP a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) # doctest: +SKIP a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
-
div
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
drop
(labels, axis=0, errors='raise')¶ Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level.
Parameters: labels : single label or list-like
Index or column labels to drop.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index, columns : single label or list-like
Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
).New in version 0.21.0.
level : int or level name, optional
For MultiIndex, level from which the labels will be removed.
inplace : bool, default False
If True, do operation inplace and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and only existing labels are dropped.
Returns: dropped : pandas.DataFrame
Raises: KeyError
If none of the labels are found in the selected axis
See also
DataFrame.loc
- Label-location based indexer for selection by label.
DataFrame.dropna
- Return DataFrame with labels on given axis omitted where (all or any) data are missing
DataFrame.drop_duplicates
- Return DataFrame with duplicate rows removed, optionally only considering certain columns
Series.drop
- Return Series with specified index labels removed.
Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3,4), # doctest: +SKIP ... columns=['A', 'B', 'C', 'D']) >>> df # doctest: +SKIP A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) # doctest: +SKIP A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) # doctest: +SKIP A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) # doctest: +SKIP A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], # doctest: +SKIP ... ['speed', 'weight', 'length']], ... labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], # doctest: +SKIP ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3,0.2]]) >>> df # doctest: +SKIP big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
>>> df.drop(index='cow', columns='small') # doctest: +SKIP big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) # doctest: +SKIP big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
-
drop_duplicates
(split_every=None, split_out=1, **kwargs)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
inplace : boolean, default False
Whether to drop duplicates in place or to return a copy
Returns: deduplicated : DataFrame
-
dropna
(how='any', subset=None)¶ Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Determine if rows or columns which contain missing values are removed.
- 0, or ‘index’ : Drop rows which contain missing values.
- 1, or ‘columns’ : Drop columns which contain missing value.
Deprecated since version 0.23.0:: Pass tuple or list to drop on multiple
axes.
how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
- ‘any’ : If any NA values are present, drop that row or column.
- ‘all’ : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.
Returns: DataFrame
DataFrame with NA entries dropped from it.
See also
DataFrame.isna
- Indicate missing values.
DataFrame.notna
- Indicate existing (non-missing) values.
DataFrame.fillna
- Replace missing values.
Series.dropna
- Drop missing values.
Index.dropna
- Drop missing indices.
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], # doctest: +SKIP ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df # doctest: +SKIP name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns') # doctest: +SKIP name 0 Alfred 1 Batman 2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all') # doctest: +SKIP name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2) # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'born']) # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True) # doctest: +SKIP >>> df # doctest: +SKIP name toy born 1 Batman Batmobile 1940-04-25
-
dtypes
¶ Return data types
-
eq
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods eq
-
eval
(expr, inplace=None, **kwargs)¶ Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
Parameters: expr : str
The expression string to evaluate.
inplace : bool, default False
If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.
New in version 0.18.0..
kwargs : dict
Returns: ndarray, scalar, or pandas object
The result of the evaluation.
See also
DataFrame.query
- Evaluates a boolean expression to query the columns of a frame.
DataFrame.assign
- Can evaluate an expression or function to create new values for a column.
pandas.eval
- Evaluate a Python expression as a string using various backends.
Notes
For more details see the API documentation for
eval()
. For detailed examples see enhancing performance with eval.Examples
>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2 >>> df.eval('A + B') # doctest: +SKIP 0 11 1 10 2 9 3 8 4 7 dtype: int64
Assignment is allowed though by default the original DataFrame is not modified.
>>> df.eval('C = A + B') # doctest: +SKIP A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7 >>> df # doctest: +SKIP A B 0 1 10 1 2 8 2 3 6 3 4 4 4 5 2
Use
inplace=True
to modify the original DataFrame.>>> df.eval('C = A + B', inplace=True) # doctest: +SKIP >>> df # doctest: +SKIP A B C 0 1 10 11 1 2 8 10 2 3 6 9 3 4 4 8 4 5 2 7
-
ffill
(axis=None, limit=None)¶ Synonym for
DataFrame.fillna(method='ffill')
-
fillna
(value=None, method=None, limit=None, axis=None)¶ Fill NA/NaN values using the specified method
Parameters: value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
axis : {0 or ‘index’, 1 or ‘columns’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns: filled : DataFrame
See also
interpolate
- Fill NaN values using interpolation.
reindex
,asfreq
Examples
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], # doctest: +SKIP ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list('ABCD')) >>> df # doctest: +SKIP A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) # doctest: +SKIP A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill') # doctest: +SKIP A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} # doctest: +SKIP >>> df.fillna(value=values) # doctest: +SKIP A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) # doctest: +SKIP A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
-
first
(offset)¶ Convenience method for subsetting initial periods of time series data based on a date offset.
Parameters: offset : string, DateOffset, dateutil.relativedelta
Returns: subset : type of caller
Raises: TypeError
If the index is not a
DatetimeIndex
See also
last
- Select final periods of time series based on a date offset
at_time
- Select values at a particular time of the day
between_time
- Select values between particular times of the day
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') # doctest: +SKIP >>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i) # doctest: +SKIP >>> ts # doctest: +SKIP A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the first 3 days:
>>> ts.first('3D') # doctest: +SKIP A 2018-04-09 1 2018-04-11 2
Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
-
floordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
ge
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ge
-
get_dtype_counts
()¶ Return counts of unique dtypes in this object.
Returns: dtype : Series
Series with the count of columns with each dtype.
See also
dtypes
- Return the dtypes in this object.
Examples
>>> a = [['a', 1, 1.0], ['b', 2, 2.0], ['c', 3, 3.0]] # doctest: +SKIP >>> df = pd.DataFrame(a, columns=['str', 'int', 'float']) # doctest: +SKIP >>> df # doctest: +SKIP str int float 0 a 1 1.0 1 b 2 2.0 2 c 3 3.0
>>> df.get_dtype_counts() # doctest: +SKIP float64 1 int64 1 object 1 dtype: int64
-
get_ftype_counts
()¶ Return counts of unique ftypes in this object.
Deprecated since version 0.23.0.
This is useful for SparseDataFrame or for DataFrames containing sparse arrays.
Returns: dtype : Series
Series with the count of columns with each type and sparsity (dense/sparse)
See also
ftypes
- Return ftypes (indication of sparse/dense and dtype) in this object.
Examples
>>> a = [['a', 1, 1.0], ['b', 2, 2.0], ['c', 3, 3.0]] # doctest: +SKIP >>> df = pd.DataFrame(a, columns=['str', 'int', 'float']) # doctest: +SKIP >>> df # doctest: +SKIP str int float 0 a 1 1.0 1 b 2 2.0 2 c 3 3.0
>>> df.get_ftype_counts() # doctest: +SKIP float64:dense 1 int64:dense 1 object:dense 1 dtype: int64
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(by=None, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters: by : mapping, function, label, or list of labels
Used to determine the groups for the groupby. If
by
is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns inself
. Notice that a tuple is interpreted a (single) key.axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible, otherwise return a consistent type
observed : boolean, default False
This only applies if any of the groupers are Categoricals If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
New in version 0.23.0.
Returns: GroupBy object
See also
resample
- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
DataFrame results
>>> data.groupby(func, axis=0).mean() # doctest: +SKIP >>> data.groupby(['col1', 'col2'])['col3'].mean() # doctest: +SKIP
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean() # doctest: +SKIP
-
gt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods gt
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: n : int, optional
The number of rows to return. Default is 5.
npartitions : int, optional
Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions.compute : bool, optional
Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: idxmax : Series
Raises: ValueError
- If the row/column is empty
See also
Notes
This method is the DataFrame version of
ndarray.argmax
.
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: idxmin : Series
Raises: ValueError
- If the row/column is empty
See also
Notes
This method is the DataFrame version of
ndarray.argmin
.
-
iloc
¶ Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
See Indexing into Dask DataFrames for more.
Examples
>>> df.iloc[:, [2, 0, 1]] # doctest: +SKIP
-
index
¶ Return dask Index instance
-
info
(buf=None, verbose=False, memory_usage=False)¶ Concise summary of a Dask DataFrame.
-
isin
(values)¶ Return boolean DataFrame showing whether each element in the DataFrame is contained in values.
Parameters: values : iterable, Series, DataFrame or dictionary
The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dictionary, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.
Returns: DataFrame of booleans
Examples
When
values
is a list:>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) # doctest: +SKIP >>> df.isin([1, 3, 12, 'a']) # doctest: +SKIP A B 0 True True 1 False False 2 True False
When
values
is a dict:>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]}) # doctest: +SKIP >>> df.isin({'A': [1, 3], 'B': [4, 7, 12]}) # doctest: +SKIP A B 0 True False # Note that B didn't match the 1 here. 1 False True 2 True True
When
values
is a Series or DataFrame:>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']}) # doctest: +SKIP >>> other = DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']}) # doctest: +SKIP >>> df.isin(other) # doctest: +SKIP A B 0 True False 1 False False # Column A in `other` has a 3, but not at index 1. 2 True True
-
isna
()¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.isnull
- alias of isna
DataFrame.notna
- boolean inverse of isna
DataFrame.dropna
- omit axes labels with missing values
isna
- top-level isna
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], # doctest: +SKIP ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() # doctest: +SKIP age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() # doctest: +SKIP 0 False 1 False 2 True dtype: bool
-
isnull
()¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.isnull
- alias of isna
DataFrame.notna
- boolean inverse of isna
DataFrame.dropna
- omit axes labels with missing values
isna
- top-level isna
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], # doctest: +SKIP ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() # doctest: +SKIP age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() # doctest: +SKIP 0 False 1 False 2 True dtype: bool
-
iterrows
()¶ Iterate over DataFrame rows as (index, Series) pairs.
Returns: it : generator
A generator that iterates over the rows of the frame.
See also
itertuples
- Iterate over DataFrame rows as namedtuples of the values.
iteritems
- Iterate over (column name, Series) pairs.
Notes
Because
iterrows
returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) # doctest: +SKIP >>> row = next(df.iterrows())[1] # doctest: +SKIP >>> row # doctest: +SKIP int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) # doctest: +SKIP float64 >>> print(df['int'].dtype) # doctest: +SKIP int64
To preserve dtypes while iterating over the rows, it is better to use
itertuples()
which returns namedtuples of the values and which is generally faster thaniterrows
.You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
-
itertuples
()¶ Iterate over DataFrame rows as namedtuples, with index value as first element of the tuple.
Parameters: index : boolean, default True
If True, return the index as the first element of the tuple.
name : string, default “Pandas”
The name of the returned namedtuples or None to return regular tuples.
See also
iterrows
- Iterate over DataFrame rows as (index, Series) pairs.
iteritems
- Iterate over (column name, Series) pairs.
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, # doctest: +SKIP index=['a', 'b']) >>> df # doctest: +SKIP col1 col2 a 1 0.1 b 2 0.2 >>> for row in df.itertuples(): # doctest: +SKIP ... print(row) ... Pandas(Index='a', col1=1, col2=0.10000000000000001) Pandas(Index='b', col1=2, col2=0.20000000000000001)
-
join
(other, on=None, how='left', lsuffix='', rsuffix='', npartitions=None, shuffle=None)¶ Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.
Parameters: other : DataFrame, Series with name field set, or list of DataFrame
Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame
on : name, tuple/list of names, or array-like
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’
How to handle the operation of the two objects.
- left: use calling frame’s index (or column if on is specified)
- right: use other frame’s index
- outer: form union of calling frame’s index (or column if on is specified) with other frame’s index, and sort it lexicographically
- inner: form intersection of calling frame’s index (or column if on is specified) with other frame’s index, preserving the order of the calling’s one
lsuffix : string
Suffix to use from left frame’s overlapping columns
rsuffix : string
Suffix to use from right frame’s overlapping columns
sort : boolean, default False
Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword)
Returns: joined : DataFrame
See also
DataFrame.merge
- For column(s)-on-columns(s) operations
Notes
on, lsuffix, and rsuffix options are not supported when passing a list of DataFrame objects
Support for specifying index levels as the on parameter was added in version 0.23.0
Examples
>>> caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], # doctest: +SKIP ... 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> caller # doctest: +SKIP A key 0 A0 K0 1 A1 K1 2 A2 K2 3 A3 K3 4 A4 K4 5 A5 K5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'], # doctest: +SKIP ... 'B': ['B0', 'B1', 'B2']})
>>> other # doctest: +SKIP B key 0 B0 K0 1 B1 K1 2 B2 K2
Join DataFrames using their indexes.
>>> caller.join(other, lsuffix='_caller', rsuffix='_other') # doctest: +SKIP
>>> A key_caller B key_other # doctest: +SKIP 0 A0 K0 B0 K0 1 A1 K1 B1 K1 2 A2 K2 B2 K2 3 A3 K3 NaN NaN 4 A4 K4 NaN NaN 5 A5 K5 NaN NaN
If we want to join using the key columns, we need to set key to be the index in both caller and other. The joined DataFrame will have key as its index.
>>> caller.set_index('key').join(other.set_index('key')) # doctest: +SKIP
>>> A B # doctest: +SKIP key K0 A0 B0 K1 A1 B1 K2 A2 B2 K3 A3 NaN K4 A4 NaN K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in the caller. This method preserves the original caller’s index in the result.
>>> caller.join(other.set_index('key'), on='key') # doctest: +SKIP
>>> A key B # doctest: +SKIP 0 A0 K0 B0 1 A1 K1 B1 2 A2 K2 B2 3 A3 K3 NaN 4 A4 K4 NaN 5 A5 K5 NaN
-
known_divisions
¶ Whether divisions are already known
-
last
(offset)¶ Convenience method for subsetting final periods of time series data based on a date offset.
Parameters: offset : string, DateOffset, dateutil.relativedelta
Returns: subset : type of caller
Raises: TypeError
If the index is not a
DatetimeIndex
See also
first
- Select initial periods of time series based on a date offset
at_time
- Select values at a particular time of the day
between_time
- Select values between particular times of the day
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') # doctest: +SKIP >>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i) # doctest: +SKIP >>> ts # doctest: +SKIP A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the last 3 days:
>>> ts.last('3D') # doctest: +SKIP A 2018-04-13 3 2018-04-15 4
Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
-
le
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods le
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] # doctest: +SKIP >>> df.loc["b":"d"] # doctest: +SKIP
-
lt
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods lt
-
map_overlap
(func, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
This can be useful for implementing windowing functions such as
df.rolling(...).mean()
ordf.diff()
.Parameters: func : function
Function applied to each partition.
before : int
The number of rows to prepend to partition
i
from the end of partitioni - 1
.after : int
The number of rows to append to partition
i
from the beginning of partitioni + 1
.args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Notes
Given positive integers
before
andafter
, and a functionfunc
,map_overlap
does the following:- Prepend
before
rows to each partitioni
from the end of partitioni - 1
. The first partition has no rows prepended. - Append
after
rows to each partitioni
from the beginning of partitioni + 1
. The last partition has no rows appended. - Apply
func
to each partition, passing in any extraargs
andkwargs
if provided. - Trim
before
rows from the beginning of all but the first partition. - Trim
after
rows from the end of all but the last partition.
Note that the index and divisions are assumed to remain unchanged.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to
df.rolling(2).sum()
:>>> ddf.compute() x y 0 1 1.0 1 2 2.0 2 4 3.0 3 7 4.0 4 11 5.0 >>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute() x y 0 NaN NaN 1 3.0 3.0 2 6.0 5.0 3 11.0 7.0 4 18.0 9.0
The pandas
diff
method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls todf.diff
to each partition after prepending/appending that many rows, depending on sign:>>> def diff(df, periods=1): ... before, after = (periods, 0) if periods > 0 else (0, -periods) ... return df.map_overlap(lambda df, periods=1: df.diff(periods), ... periods, 0, periods=periods) >>> diff(ddf, 1).compute() x y 0 NaN NaN 1 1.0 1.0 2 2.0 1.0 3 3.0 1.0 4 4.0 1.0
If you have a
DatetimeIndex
, you can use apd.Timedelta
for time- based windows.>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling('2D').sum(), ... pd.Timedelta('2D'), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 dtype: float64
- Prepend
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Note that the index and divisions are assumed to remain unchanged.
Parameters: func : function
Function applied to each partition.
args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after. Arguments and keywords may contain
Scalar
,Delayed
or regular python objects.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:
>>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP
-
mask
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Parameters: cond : boolean NDFrame, array-like, or callable
Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as cond.
other : scalar, NDFrame, or callable
Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
errors : str, {‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Deprecated since version 0.21.0.
Returns: wh : same type as caller
See also
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isFalse
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
mask
documentation in indexing.Examples
>>> s = pd.Series(range(5)) # doctest: +SKIP >>> s.where(s > 0) # doctest: +SKIP 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> s.mask(s > 0) # doctest: +SKIP 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN
>>> s.where(s > 1, 10) # doctest: +SKIP 0 10.0 1 10.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) # doctest: +SKIP >>> m = df % 3 == 0 # doctest: +SKIP >>> df.where(m, -df) # doctest: +SKIP A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
max
(axis=None, skipna=True, split_every=False, out=None)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max : Series or DataFrame (if level specified)
-
mean
(axis=None, skipna=True, split_every=False, dtype=None, out=None)¶ Return the mean of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: mean : Series or DataFrame (if level specified)
-
memory_usage
(index=True, deep=False)¶ Return the memory usage of each column in bytes.
The memory usage can optionally include the contribution of the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be suppressed by setting
pandas.options.display.memory_usage
to False.Parameters: index : bool, default True
Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If
index=True
the memory usage of the index the first item in the output.deep : bool, default False
If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
Returns: sizes : Series
A Series whose index is the original column names and whose values is the memory usage of each column in bytes.
See also
numpy.ndarray.nbytes
- Total bytes consumed by the elements of an ndarray.
Series.memory_usage
- Bytes consumed by a Series.
pandas.Categorical
- Memory-efficient array for string values with many repeated values.
DataFrame.info
- Concise summary of a DataFrame.
Examples
>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool'] # doctest: +SKIP >>> data = dict([(t, np.ones(shape=5000).astype(t)) # doctest: +SKIP ... for t in dtypes]) >>> df = pd.DataFrame(data) # doctest: +SKIP >>> df.head() # doctest: +SKIP int64 float64 complex128 object bool 0 1 1.0 (1+0j) 1 True 1 1 1.0 (1+0j) 1 True 2 1 1.0 (1+0j) 1 True 3 1 1.0 (1+0j) 1 True 4 1 1.0 (1+0j) 1 True
>>> df.memory_usage() # doctest: +SKIP Index 80 int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
>>> df.memory_usage(index=False) # doctest: +SKIP int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
The memory footprint of object dtype columns is ignored by default:
>>> df.memory_usage(deep=True) # doctest: +SKIP Index 80 int64 40000 float64 40000 complex128 80000 object 160000 bool 5000 dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True) # doctest: +SKIP 5168
-
merge
(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, suffixes=('_x', '_y'), indicator=False, npartitions=None, shuffle=None)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters: right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
on : label or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
right_index : boolean, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index
sort : boolean, default False
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword)
suffixes : 2-length sequence (tuple, list, …)
Suffix to apply to overlapping column names in the left and right side, respectively
copy : boolean, default True
If False, do not copy data unnecessarily
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
validate : string, default None
If specified, checks if merge is of specified type.
- “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
- “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
- “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
- “many_to_many” or “m:m”: allowed, but does not result in checks.
New in version 0.21.0.
Returns: merged : DataFrame
The output type will the be same as ‘left’, if it is a subclass of DataFrame.
See also
merge_ordered
,merge_asof
,DataFrame.join
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0
Examples
>>> A >>> B # doctest: +SKIP lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') # doctest: +SKIP lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7
-
min
(axis=None, skipna=True, split_every=False, out=None)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: min : Series or DataFrame (if level specified)
-
mod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
mul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
ndim
¶ Return dimensionality
-
ne
(other, axis='columns', level=None)¶ Wrapper for flexible comparison methods ne
-
nlargest
(n=5, columns=None, split_every=None)¶ Return the first n rows ordered by columns in descending order.
Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=False).head(n)
, but more performant.Parameters: n : int
Number of rows to return.
columns : label or list of labels
Column label(s) to order by.
keep : {‘first’, ‘last’}, default ‘first’
Where there are duplicate values:
- first : prioritize the first occurrence(s)
- last : prioritize the last occurrence(s)
Returns: DataFrame
The first n rows ordered by the given columns in descending order.
See also
DataFrame.nsmallest
- Return the first n rows ordered by columns in ascending order.
DataFrame.sort_values
- Sort DataFrame by the values
DataFrame.head
- Return the first n rows without re-ordering.
Notes
This function cannot be used with all column types. For example, when specifying columns with object or category dtypes,
TypeError
is raised.Examples
>>> df = pd.DataFrame({'a': [1, 10, 8, 10, -1], # doctest: +SKIP ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df # doctest: +SKIP a b c 0 1 a 1.0 1 10 b 2.0 2 8 d NaN 3 10 c 3.0 4 -1 e 4.0
In the following example, we will use
nlargest
to select the three rows having the largest values in column “a”.>>> df.nlargest(3, 'a') # doctest: +SKIP a b c 1 10 b 2.0 3 10 c 3.0 2 8 d NaN
When using
keep='last'
, ties are resolved in reverse order:>>> df.nlargest(3, 'a', keep='last') # doctest: +SKIP a b c 3 10 c 3.0 1 10 b 2.0 2 8 d NaN
To order by the largest values in column “a” and then “c”, we can specify multiple columns like in the next example.
>>> df.nlargest(3, ['a', 'c']) # doctest: +SKIP a b c 3 10 c 3.0 1 10 b 2.0 2 8 d NaN
Attempting to use
nlargest
on non-numeric dtypes will raise aTypeError
:>>> df.nlargest(3, 'b') # doctest: +SKIP Traceback (most recent call last): TypeError: Column 'b' has dtype object, cannot use method 'nlargest'
-
notnull
()¶ Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.notnull
- alias of notna
DataFrame.isna
- boolean inverse of notna
DataFrame.dropna
- omit axes labels with missing values
notna
- top-level notna
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], # doctest: +SKIP ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() # doctest: +SKIP age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() # doctest: +SKIP 0 True 1 True 2 False dtype: bool
-
npartitions
¶ Return number of partitions
-
nsmallest
(n=5, columns=None, split_every=None)¶ Get the rows of a DataFrame sorted by the n smallest values of columns.
Parameters: n : int
Number of items to retrieve
columns : list or str
Column name or names to order by
keep : {‘first’, ‘last’}, default ‘first’
Where there are duplicate values: -
first
: take the first occurrence. -last
: take the last occurrence.Returns: DataFrame
Examples
>>> df = pd.DataFrame({'a': [1, 10, 8, 11, -1], # doctest: +SKIP ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nsmallest(3, 'a') # doctest: +SKIP a b c 4 -1 e 4 0 1 a 1 2 8 d NaN
-
nunique_approx
(split_every=None)¶ Approximate number of unique rows.
This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.
Parameters: split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.
Returns: a float representing the approximate number of elements
-
partitions
¶ Slice dataframe by partitions
This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example,
df.partitions[:5]
produces a new Dask Dataframe of the first five partitions.Returns: A Dask DataFrame Examples
>>> df.partitions[0] # doctest: +SKIP >>> df.partitions[:3] # doctest: +SKIP >>> df.partitions[::10] # doctest: +SKIP
-
persist
(**kwargs)¶ Persist this dask collection into memory
This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.
The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.
This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.
Parameters: scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
**kwargs
Extra keywords to forward to the scheduler function.
Returns: New dask collections backed by in-memory data
See also
dask.base.persist
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
Parameters: func : function
function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame.args : iterable, optional
positional arguments passed into
func
.kwargs : mapping, optional
a dictionary of keyword arguments passed into
func
.Returns: object : the return type of
func
.Notes
Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c) # doctest: +SKIP
You can write
>>> (df.pipe(h) # doctest: +SKIP ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) # doctest: +SKIP ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
-
pivot_table
(index=None, columns=None, values=None, aggfunc='mean')¶ Create a spreadsheet-style pivot table as a DataFrame. Target
columns
must have category dtype to infer result’scolumns
.index
,columns
,values
andaggfunc
must be all scalar.Parameters: values : scalar
column to aggregate
index : scalar
column to be index
columns : scalar
column to be columns
aggfunc : {‘mean’, ‘sum’, ‘count’}, default ‘mean’
Returns: table : DataFrame
-
pow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
prod
(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶ Return the product of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns: prod : Series or DataFrame (if level specified)
Examples
By default, the product of an empty or all-NA Series is
1
>>> pd.Series([]).prod() # doctest: +SKIP 1.0
This can be controlled with the
min_count
parameter>>> pd.Series([]).prod(min_count=1) # doctest: +SKIP nan
Thanks to the
skipna
parameter,min_count
handles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() # doctest: +SKIP 1.0
>>> pd.Series([np.nan]).prod(min_count=1) # doctest: +SKIP nan
-
quantile
(q=0.5, axis=0)¶ Approximate row-wise and precise column-wise quantiles of DataFrame
Parameters: q : list/array of floats, default 0.5 (50%)
Iterable of numbers ranging from 0 to 1 for the desired quantiles
axis : {0, 1, ‘index’, ‘columns’} (default 0)
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
-
query
(expr, **kwargs)¶ Filter dataframe with complex expression
Blocked version of pd.DataFrame.query
This is like the sequential version except that this will also happen in many threads. This may conflict with
numexpr
which will use multiple threads itself. We recommend that you set numexpr to use a single threadimport numexpr numexpr.set_nthreads(1)See also
-
radd
(other, axis='columns', level=None, fill_value=None)¶ Addition of dataframe and other, element-wise (binary operator radd).
Equivalent to
other + dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
>>> a = pd.DataFrame([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'], ... columns=['one']) >>> a one a 1.0 b 1.0 c 1.0 d NaN >>> b = pd.DataFrame(dict(one=[1, np.nan, 1, np.nan], ... two=[np.nan, 2, np.nan, 2]), ... index=['a', 'b', 'd', 'e']) >>> b one two a 1.0 NaN b NaN 2.0 d 1.0 NaN e NaN 2.0 >>> a.add(b, fill_value=0) one two a 2.0 NaN b 1.0 2.0 c 1.0 NaN d 1.0 NaN e NaN 2.0
-
random_split
(frac, random_state=None)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: frac : list
List of floats that should sum to one.
random_state: int or np.random.RandomState
If int create a new RandomState with this as the seed
Otherwise draw from the passed RandomState
See also
dask.DataFrame.sample
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5]) # doctest: +SKIP
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123) # doctest: +SKIP
-
rdiv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: chunk : callable
Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.aggregate : callable, optional
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a:- scalar: Input is a Series, with one row per partition.
- Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.combine : callable, optional
Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.token : str, optional
The name to use for the output keys.
split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to
aggregate
. Default is 8.chunk_kwargs : dict, optional
Keyword arguments to pass on to
chunk
only.aggregate_kwargs : dict, optional
Keyword arguments to pass on to
aggregate
only.combine_kwargs : dict, optional
Keyword arguments to pass on to
combine
only.kwargs :
All remaining keywords will be passed to
chunk
,combine
, andaggregate
.Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
-
rename
(index=None, columns=None)¶ Alter axes labels.
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
See the user guide for more.
Parameters: mapper, index, columns : dict-like or function, optional
dict-like or functions transformations to apply to that axis’ values. Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
andcolumns
.axis : int or str, optional
Axis to target with
mapper
. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.copy : boolean, default True
Also copy underlying data
inplace : boolean, default False
Whether to return a new DataFrame. If True then value of copy is ignored.
level : int or level name, default None
In case of a MultiIndex, only rename labels in the specified level.
Returns: renamed : DataFrame
See also
Examples
DataFrame.rename
supports two calling conventions(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) # doctest: +SKIP >>> df.rename(index=str, columns={"A": "a", "B": "c"}) # doctest: +SKIP a c 0 1 4 1 2 5 2 3 6
>>> df.rename(index=str, columns={"A": "a", "C": "c"}) # doctest: +SKIP a B 0 1 4 1 2 5 2 3 6
Using axis-style parameters
>>> df.rename(str.lower, axis='columns') # doctest: +SKIP a b 0 1 4 1 2 5 2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index') # doctest: +SKIP A B 0 1 4 2 2 5 4 3 6
-
repartition
(divisions=None, npartitions=None, freq=None, force=False)¶ Repartition dataframe along new divisions
Parameters: divisions : list, optional
List of partitions to be used. If specified npartitions will be ignored.
npartitions : int, optional
Number of partitions of output. Only used if divisions isn’t specified.
freq : str, pd.Timedelta
A period on which to partition timeseries data like
'7D'
or'12h'
orpd.Timedelta(hours=12)
. Assumes a datetime index.force : bool, default False
Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.
Examples
>>> df = df.repartition(npartitions=10) # doctest: +SKIP >>> df = df.repartition(divisions=[0, 5, 10, 20]) # doctest: +SKIP >>> df = df.repartition(freq='7d') # doctest: +SKIP
-
resample
(rule, closed=None, label=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: rule : string
the offset string or object representing target conversion
axis : int, optional, default 0
closed : {‘right’, ‘left’}
Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label : {‘right’, ‘left’}
Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention : {‘start’, ‘end’, ‘s’, ‘e’}
For PeriodIndex only, controls whether to use the start or end of rule
kind: {‘timestamp’, ‘period’}, optional
Pass ‘timestamp’ to convert the resulting index to a
DateTimeIndex
or ‘period’ to convert it to aPeriodIndex
. By default the input representation is retained.loffset : timedelta
Adjust the resampled time labels
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
on : string, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
level : string or int, optional
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
Returns: Resampler object
See also
groupby
- Group by mapping, function, label, or list of labels.
Notes
See the user guide for more.
To learn more about the offset strings, please see this link.
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') # doctest: +SKIP >>> series = pd.Series(range(9), index=index) # doctest: +SKIP >>> series # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() # doctest: +SKIP 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00
does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() # doctest: +SKIP 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows # doctest: +SKIP 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): # doctest: +SKIP ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) # doctest: +SKIP 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', # doctest: +SKIP freq='A', periods=2)) >>> s # doctest: +SKIP 2012 1 2013 2 Freq: A-DEC, dtype: int64
Resample by month using ‘start’ convention. Values are assigned to the first month of the period.
>>> s.resample('M', convention='start').asfreq().head() # doctest: +SKIP 2012-01 1.0 2012-02 NaN 2012-03 NaN 2012-04 NaN 2012-05 NaN Freq: M, dtype: float64
Resample by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> s.resample('M', convention='end').asfreq() # doctest: +SKIP 2012-12 1.0 2013-01 NaN 2013-02 NaN 2013-03 NaN 2013-04 NaN 2013-05 NaN 2013-06 NaN 2013-07 NaN 2013-08 NaN 2013-09 NaN 2013-10 NaN 2013-11 NaN 2013-12 2.0 Freq: M, dtype: float64
For DataFrame objects, the keyword
on
can be used to specify the column instead of the index for resampling.>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd']) # doctest: +SKIP >>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T') # doctest: +SKIP >>> df.resample('3T', on='time').sum() # doctest: +SKIP a b c d time 2000-01-01 00:00:00 0 3 6 9 2000-01-01 00:03:00 0 3 6 9 2000-01-01 00:06:00 0 3 6 9
For a DataFrame with MultiIndex, the keyword
level
can be used to specify on level the resampling needs to take place.>>> time = pd.date_range('1/1/2000', periods=5, freq='T') # doctest: +SKIP >>> df2 = pd.DataFrame(data=10*[range(4)], # doctest: +SKIP columns=['a', 'b', 'c', 'd'], index=pd.MultiIndex.from_product([time, [1, 2]]) ) >>> df2.resample('3T', level=0).sum() # doctest: +SKIP a b c d 2000-01-01 00:00:00 0 6 12 18 2000-01-01 00:03:00 0 4 8 12
-
reset_index
(drop=False)¶ Reset the index to the default index.
Note that unlike in
pandas
, the resetdask.dataframe
index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g.index1 = [0, ..., 10], index2 = [0, ...]
). This is due to the inability to statically know the full length of the index.For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
Parameters: drop : boolean, default False
Do not try to insert index into dataframe columns.
-
rfloordiv
(other, axis='columns', level=None, fill_value=None)¶ Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
rmod
(other, axis='columns', level=None, fill_value=None)¶ Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
rmul
(other, axis='columns', level=None, fill_value=None)¶ Multiplication of dataframe and other, element-wise (binary operator rmul).
Equivalent to
other * dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)¶ Provides rolling transformations.
Parameters: window : int, str, offset
Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a
DatetimeIndex
Changed in version 0.15.0: Now accepts offsets and string offset aliases
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
center : boolean, default False
Set the labels at the center of the window.
win_type : string, default None
Provide a window type. The recognized window types are identical to pandas.
axis : int, default 0
Returns: a Rolling object on which to call a method to compute a statistic
Notes
The freq argument is not supported.
-
round
(decimals=0)¶ Round a DataFrame to a variable number of decimal places.
Parameters: decimals : int, dict, Series
Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
Returns: DataFrame object
See also
Examples
>>> df = pd.DataFrame(np.random.random([3, 3]), # doctest: +SKIP ... columns=['A', 'B', 'C'], index=['first', 'second', 'third']) >>> df # doctest: +SKIP A B C first 0.028208 0.992815 0.173891 second 0.038683 0.645646 0.577595 third 0.877076 0.149370 0.491027 >>> df.round(2) # doctest: +SKIP A B C first 0.03 0.99 0.17 second 0.04 0.65 0.58 third 0.88 0.15 0.49 >>> df.round({'A': 1, 'C': 2}) # doctest: +SKIP A B C first 0.0 0.992815 0.17 second 0.0 0.645646 0.58 third 0.9 0.149370 0.49 >>> decimals = pd.Series([1, 0, 2], index=['A', 'B', 'C']) # doctest: +SKIP >>> df.round(decimals) # doctest: +SKIP A B C first 0.0 1 0.17 second 0.0 1 0.58 third 0.9 0 0.49
-
rpow
(other, axis='columns', level=None, fill_value=None)¶ Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
rsub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
>>> a = pd.DataFrame([2, 1, 1, np.nan], index=['a', 'b', 'c', 'd'], ... columns=['one']) >>> a one a 2.0 b 1.0 c 1.0 d NaN >>> b = pd.DataFrame(dict(one=[1, np.nan, 1, np.nan], ... two=[3, 2, np.nan, 2]), ... index=['a', 'b', 'd', 'e']) >>> b one two a 1.0 3.0 b NaN 2.0 d 1.0 NaN e NaN 2.0 >>> a.sub(b, fill_value=0) one two a 1.0 -3.0 b 1.0 -2.0 c 1.0 NaN d -1.0 NaN e NaN -2.0
-
rtruediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
sample
(n=None, frac=None, replace=False, random_state=None)¶ Random sample of items
Parameters: n : int, optional
Number of items to return is not supported by dask. Use frac instead.
frac : float, optional
Fraction of axis items to return.
replace : boolean, optional
Sample with or without replacement. Default = False.
random_state : int or
np.random.RandomState
If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
-
select_dtypes
(include=None, exclude=None)¶ Return a subset of the DataFrame’s columns based on the column dtypes.
Parameters: include, exclude : scalar or list-like
A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.
Returns: subset : DataFrame
The subset of the frame including the dtypes in
include
and excluding the dtypes inexclude
.Raises: ValueError
- If both of
include
andexclude
are empty - If
include
andexclude
have overlapping elements - If any kind of string dtype is passed in.
Notes
- To select all numeric types, use
np.number
or'number'
- To select strings you must use the
object
dtype, but note that this will return all object dtype columns - See the numpy dtype hierarchy
- To select datetimes, use
np.datetime64
,'datetime'
or'datetime64'
- To select timedeltas, use
np.timedelta64
,'timedelta'
or'timedelta64'
- To select Pandas categorical dtypes, use
'category'
- To select Pandas datetimetz dtypes, use
'datetimetz'
(new in 0.20.0) or'datetime64[ns, tz]'
Examples
>>> df = pd.DataFrame({'a': [1, 2] * 3, # doctest: +SKIP ... 'b': [True, False] * 3, ... 'c': [1.0, 2.0] * 3}) >>> df # doctest: +SKIP a b c 0 1 True 1.0 1 2 False 2.0 2 1 True 1.0 3 2 False 2.0 4 1 True 1.0 5 2 False 2.0
>>> df.select_dtypes(include='bool') # doctest: +SKIP b 0 True 1 False 2 True 3 False 4 True 5 False
>>> df.select_dtypes(include=['float64']) # doctest: +SKIP c 0 1.0 1 2.0 2 1.0 3 2.0 4 1.0 5 2.0
>>> df.select_dtypes(exclude=['int']) # doctest: +SKIP b c 0 True 1.0 1 False 2.0 2 True 1.0 3 False 2.0 4 True 1.0 5 False 2.0
- If both of
-
sem
(axis=None, skipna=None, ddof=1, split_every=False)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sem : Series or DataFrame (if level specified)
-
set_index
(other, drop=True, sorted=False, npartitions=None, divisions=None, **kwargs)¶ Set the DataFrame index (row labels) using an existing column
This realigns the dataset to be sorted by a new column. This can have a significant impact on performance, because joins, groupbys, lookups, etc. are all much faster on that column. However, this performance increase comes with a cost, sorting a parallel dataset requires expensive shuffles. Often we
set_index
once directly after data ingest and filtering and then perform many cheap computations off of the sorted dataset.This function operates exactly like
pandas.set_index
except with different performance costs (it is much more expensive). Under normal operation this function does an initial pass over the index column to compute approximate qunatiles to serve as future divisions. It then passes over the data a second time, splitting up each input partition into several pieces and sharing those pieces to all of the output partitions now in sorted order.In some cases we can alleviate those costs, for example if your dataset is sorted already then we can avoid making many small pieces or if you know good values to split the new index column then we can avoid the initial pass over the data. For example if your new index is a datetime index and your data is already sorted by day then this entire operation can be done for free. You can control these options with the following parameters.
Parameters: df: Dask DataFrame
index: string or Dask Series
npartitions: int, None, or ‘auto’
The ideal number of output partitions. If None use the same as the input. If ‘auto’ then decide by memory use.
shuffle: string, optional
Either
'disk'
for single-node operation or'tasks'
for distributed operation. Will be inferred by your current scheduler.sorted: bool, optional
If the index column is already sorted in increasing order. Defaults to False
divisions: list, optional
Known values on which to separate index values of the partitions. See https://docs.dask.org/en/latest/dataframe-design.html#partitions Defaults to computing this with a single pass over the data. Note that if
sorted=True
, specified divisions are assumed to match the existing partitions in the data. If this is untrue, you should leave divisions empty and callrepartition
afterset_index
.compute: bool
Whether or not to trigger an immediate computation. Defaults to False.
Examples
>>> df2 = df.set_index('x') # doctest: +SKIP >>> df2 = df.set_index(d.x) # doctest: +SKIP >>> df2 = df.set_index(d.timestamp, sorted=True) # doctest: +SKIP
A common case is when we have a datetime column that we know to be sorted and is cleanly divided by day. We can set this index for free by specifying both that the column is pre-sorted and the particular divisions along which is is separated
>>> import pandas as pd >>> divisions = pd.date_range('2000', '2010', freq='1D') >>> df2 = df.set_index('timestamp', sorted=True, divisions=divisions) # doctest: +SKIP
-
shape
¶ Return a tuple representing the dimensionality of the DataFrame.
The number of rows is a Delayed result. The number of columns is a concrete integer.
Examples
>>> df.size # doctest: +SKIP (Delayed('int-07f06075-5ecc-4d77-817e-63c69a9188a8'), 2)
-
shift
(periods=1, freq=None, axis=0)¶ Shift index by desired number of periods with an optional time freq
Parameters: periods : int
Number of periods to move, can be positive or negative
freq : DateOffset, timedelta, or time rule string, optional
Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.
axis : {0 or ‘index’, 1 or ‘columns’}
Returns: shifted : DataFrame
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
-
size
¶ Size of the Series or DataFrame as a Delayed object.
Examples
>>> series.size # doctest: +SKIP dd.Scalar<size-ag..., dtype=int64>
-
squeeze
(axis=None)¶ Squeeze length 1 dimensions.
Parameters: axis : None, integer or string axis name, optional
The axis to squeeze if 1-sized.
New in version 0.20.0.
Returns: scalar if 1-sized, else original object
-
std
(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: std : Series or DataFrame (if level specified)
-
sub
(other, axis='columns', level=None, fill_value=None)¶ Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
>>> a = pd.DataFrame([2, 1, 1, np.nan], index=['a', 'b', 'c', 'd'], ... columns=['one']) >>> a one a 2.0 b 1.0 c 1.0 d NaN >>> b = pd.DataFrame(dict(one=[1, np.nan, 1, np.nan], ... two=[3, 2, np.nan, 2]), ... index=['a', 'b', 'd', 'e']) >>> b one two a 1.0 3.0 b NaN 2.0 d 1.0 NaN e NaN 2.0 >>> a.sub(b, fill_value=0) one two a 1.0 -3.0 b 1.0 -2.0 c 1.0 NaN d -1.0 NaN e NaN -2.0
-
sum
(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶ Return the sum of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns: sum : Series or DataFrame (if level specified)
Examples
By default, the sum of an empty or all-NA Series is
0
.>>> pd.Series([]).sum() # min_count=0 is the default # doctest: +SKIP 0.0
This can be controlled with the
min_count
parameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1
.>>> pd.Series([]).sum(min_count=1) # doctest: +SKIP nan
Thanks to the
skipna
parameter,min_count
handles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() # doctest: +SKIP 0.0
>>> pd.Series([np.nan]).sum(min_count=1) # doctest: +SKIP nan
-
tail
(n=5, compute=True)¶ Last n rows of the dataset
Caveat, the only checks the last n rows of the last partition.
-
to_bag
(index=False)¶ Create Dask Bag from a Dask DataFrame
Parameters: index : bool, optional
If True, the elements are tuples of
(index, value)
, otherwise they’re just thevalue
. Default is False.Examples
>>> bag = df.to_bag() # doctest: +SKIP
-
to_csv
(filename, **kwargs)¶ Store Dask DataFrame to CSV files
One filename per partition will be created. You can specify the filenames in a variety of ways.
Use a globstring:
>>> df.to_csv('/path/to/data/export-*.csv')
The * will be replaced by the increasing sequence 0, 1, 2, …
/path/to/data/export-0.csv /path/to/data/export-1.csv
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name) # doctest: +SKIP
/path/to/data/export-2015-01-01.csv /path/to/data/export-2015-01-02.csv ...
You can also provide an explicit list of paths:
>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...] >>> df.to_csv(paths)
Parameters: filename : string
Path glob indicating the naming scheme for the output files
name_function : callable, default None
Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions
compression : string or None
String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
sep : character, default ‘,’
Field delimiter for the output file
na_rep : string, default ‘’
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
header_first_partition_only : boolean, default False
If set, only write the header row in the first output file
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
line_terminator : string, default ‘n’
The newline character or character sequence to use in the output file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data
storage_options: dict
Parameters passed on to the backend filesystem class.
Returns: The names of the file written if they were computed right away
If not, the delayed tasks associated to the writing of the files
-
to_dask_array
(lengths=None)¶ Convert a dask DataFrame to a dask array.
Parameters: lengths : bool or Sequence of ints, optional
How to determine the chunks sizes for the output array. By default, the output array will have unknown chunk lengths along the first axis, which can cause some later operations to fail.
- True : immediately compute the length of each partition
- Sequence : a sequence of integers to use for the chunk sizes on the first axis. These values are not validated for correctness, beyond ensuring that the number of items matches the number of partitions.
-
to_delayed
(optimize_graph=True)¶ Convert into a list of
dask.delayed
objects, one per partition.Parameters: optimize_graph : bool, optional
If True [default], the graph is optimized before converting into
dask.delayed
objects.See also
Examples
>>> partitions = df.to_delayed() # doctest: +SKIP
-
to_hdf
(path_or_buf, key, mode='a', append=False, **kwargs)¶ Store Dask Dataframe to Hierarchical Data Format (HDF) files
This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.
This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix
*
within the filename or datapath, and an optionalname_function
. The asterix will be replaced with an increasing sequence of integers starting from0
or with the result of callingname_function
on each of those integers.This function only supports the Pandas
'table'
format, not the more specialized'fixed'
format.Parameters: path : string
Path to a target filename. May contain a
*
to denote many filenameskey : string
Datapath within the files. May contain a
*
to denote many locationsname_function : function
A function to convert the
*
in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)compute : bool
Whether or not to execute immediately. If False then this returns a
dask.Delayed
value.lock : Lock, optional
Lock to use to prevent concurrency issues. By default a
threading.Lock
,multiprocessing.Lock
orSerializableLock
will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.scheduler : string
The scheduler to use, like “threads” or “processes”
**other:
See pandas.to_hdf for more information
Returns: filenames : list
Returned if
compute
is True. List of file names that each partition is saved to.delayed : dask.Delayed
Returned if
compute
is False. Delayed object to executeto_hdf
when computed.See also
Examples
Save Data to a single file
>>> df.to_hdf('output.hdf', '/data') # doctest: +SKIP
Save data to multiple datapaths within the same file:
>>> df.to_hdf('output.hdf', '/data-*') # doctest: +SKIP
Save data to multiple files:
>>> df.to_hdf('output-*.hdf', '/data') # doctest: +SKIP
Save data to multiple files, using the multiprocessing scheduler:
>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') # doctest: +SKIP
Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..
>>> from datetime import date, timedelta >>> base = date(year=2000, month=1, day=1) >>> def name_function(i): ... ''' Convert integer 0 to n to a string ''' ... return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) # doctest: +SKIP
-
to_html
(max_rows=5)¶ Render a DataFrame as an HTML table.
to_html-specific options:
- bold_rows : boolean, default True
- Make the row labels bold in the output
- classes : str or list or tuple, default None
- CSS class(es) to apply to the resulting html table
- escape : boolean, default True
- Convert the characters <, >, and & to HTML-safe sequences.
- max_rows : int, optional
- Maximum number of rows to show before truncating. If None, show all.
- max_cols : int, optional
- Maximum number of columns to show before truncating. If None, show all.
- decimal : string, default ‘.’
Character recognized as decimal separator, e.g. ‘,’ in Europe
New in version 0.18.0.
- border : int
A
border=border
attribute is included in the opening <table> tag. Defaultpd.options.html.border
.New in version 0.19.0.
- table_id : str, optional
A css id is included in the opening <table> tag if specified.
New in version 0.23.0.
Parameters: buf : StringIO-like, optional
buffer to write to
columns : sequence, optional
the subset of columns to write; default None writes all columns
col_space : int, optional
the minimum width of each column
header : bool, optional
whether to print column labels, default True
index : bool, optional
whether to print index (row) labels, default True
na_rep : string, optional
string representation of NAN to use, default ‘NaN’
formatters : list or dict of one-parameter functions, optional
formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.
float_format : one-parameter function, optional
formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
sparsify : bool, optional
Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True
index_names : bool, optional
Prints the names of the indexes, default True
line_width : int, optional
Width to wrap a line in characters, default no wrap
table_id : str, optional
id for the <table> element create by to_html
New in version 0.23.0.
justify : str, default None
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
- left
- right
- center
- justify
- justify-all
- start
- end
- inherit
- match-parent
- initial
- unset
Returns: formatted : string (or unicode, depending on data and options)
Dask doesn’t support the following argument(s).
- buf
- columns
- col_space
- header
- index
- na_rep
- formatters
- float_format
- sparsify
- index_names
- justify
- bold_rows
- classes
- escape
- max_cols
- show_dimensions
- notebook
- decimal
- border
- table_id
-
to_json
(filename, *args, **kwargs)¶ See dd.to_json docstring for more information
-
to_parquet
(path, *args, **kwargs)¶ Store Dask.dataframe to Parquet files
Parameters: df : dask.dataframe.DataFrame
path : string
Destination directory for data. Prepend with protocol like
s3://
orhdfs://
for remote data.engine : {‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’
Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.
compression : string or dict, optional
Either a string like
"snappy"
or a dictionary mapping column names to compressors like{"name": "gzip", "values": "snappy"}
. The default is"default"
, which uses the default compression for whichever engine is selected.write_index : boolean, optional
Whether or not to write the index. Defaults to True if divisions are known.
append : bool, optional
If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.
ignore_divisions : bool, optional
If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.
partition_on : list, optional
Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.
storage_options : dict, optional
Key/value pairs to be passed on to the file-system backend, if any.
compute : bool, optional
If True (default) then the result is computed immediately. If False then a
dask.delayed
object is returned for future computation.**kwargs
Extra options to be passed on to the specific backend.
See also
read_parquet
- Read parquet data to dask.dataframe
Notes
Each partition will be written to a separate file.
Examples
>>> df = dd.read_csv(...) # doctest: +SKIP >>> to_parquet('/path/to/output/', df, compression='snappy') # doctest: +SKIP
-
to_records
(index=False)¶ Create Dask Array from a Dask Dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
See also
dask.dataframe._Frame.values
,dask.dataframe.from_dask_array
Examples
>>> df.to_records() # doctest: +SKIP dask.array<shape=(nan,), dtype=(numpy.record, [('ind', '<f8'), ('x', 'O'), ('y', '<i8')]), chunksize=(nan,)>
-
to_string
(max_rows=5)¶ Render a DataFrame to a console-friendly tabular output.
Parameters: buf : StringIO-like, optional
buffer to write to
columns : sequence, optional
the subset of columns to write; default None writes all columns
col_space : int, optional
the minimum width of each column
header : bool, optional
Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names
index : bool, optional
whether to print index (row) labels, default True
na_rep : string, optional
string representation of NAN to use, default ‘NaN’
formatters : list or dict of one-parameter functions, optional
formatter functions to apply to columns’ elements by position or name, default None. The result of each function must be a unicode string. List must be of length equal to the number of columns.
float_format : one-parameter function, optional
formatter function to apply to columns’ elements if they are floats, default None. The result of this function must be a unicode string.
sparsify : bool, optional
Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row, default True
index_names : bool, optional
Prints the names of the indexes, default True
line_width : int, optional
Width to wrap a line in characters, default no wrap
table_id : str, optional
id for the <table> element create by to_html
New in version 0.23.0.
justify : str, default None
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
- left
- right
- center
- justify
- justify-all
- start
- end
- inherit
- match-parent
- initial
- unset
Returns: formatted : string (or unicode, depending on data and options)
Dask doesn’t support the following argument(s).
- buf
- columns
- col_space
- header
- index
- na_rep
- formatters
- float_format
- sparsify
- index_names
- justify
- line_width
- max_cols
- show_dimensions
-
to_timestamp
(freq=None, how='start', axis=0)¶ Cast to DatetimeIndex of timestamps, at beginning of period
Parameters: freq : string, default frequency of PeriodIndex
Desired frequency
how : {‘s’, ‘e’, ‘start’, ‘end’}
Convention for converting period to timestamp; start of period vs. end
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default)
copy : boolean, default True
If false then underlying input data is not copied
Returns: df : DataFrame with DatetimeIndex
-
truediv
(other, axis='columns', level=None, fill_value=None)¶ Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
Returns: result : DataFrame
See also
Notes
Mismatched indices will be unioned together
Examples
None
-
values
¶ Return a dask.array of the values of this dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
-
var
(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: var : Series or DataFrame (if level specified)
-
visualize
(filename='mydask', format=None, optimize_graph=False, **kwargs)¶ Render the computation of this object’s task graph using graphviz.
Requires
graphviz
to be installed.Parameters: filename : str or None, optional
The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional
Format in which to write output file. Default is ‘png’.
optimize_graph : bool, optional
If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
color: {None, ‘order’}, optional
Options to color nodes. Provide
cmap=
keyword for additional colormap**kwargs
Additional keyword arguments to forward to
to_graphviz
.Returns: result : IPython.diplay.Image, IPython.display.SVG, or None
See dask.dot.dot_graph for more information.
See also
dask.base.visualize
,dask.dot.dot_graph
Notes
For more information on optimization see here:
https://docs.dask.org/en/latest/optimize.html
Examples
>>> x.visualize(filename='dask.pdf') # doctest: +SKIP >>> x.visualize(filename='dask.pdf', color='order') # doctest: +SKIP
-
where
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Parameters: cond : boolean NDFrame, array-like, or callable
Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as cond.
other : scalar, NDFrame, or callable
Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
errors : str, {‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Deprecated since version 0.21.0.
Returns: wh : same type as caller
See also
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isTrue
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
where
documentation in indexing.Examples
>>> s = pd.Series(range(5)) # doctest: +SKIP >>> s.where(s > 0) # doctest: +SKIP 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> s.mask(s > 0) # doctest: +SKIP 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN
>>> s.where(s > 1, 10) # doctest: +SKIP 0 10.0 1 10.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) # doctest: +SKIP >>> m = df % 3 == 0 # doctest: +SKIP >>> df.where(m, -df) # doctest: +SKIP A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
Series Methods¶
-
class
dask.dataframe.
Series
(dsk, name, meta, divisions)¶ Parallel Pandas Series
Do not use this class directly. Instead use functions like
dd.read_csv
,dd.read_parquet
, ordd.from_pandas
.Parameters: dsk: dict
The dask graph to compute this Series
_name: str
The key prefix that specifies which keys in the dask comprise this particular Series
meta: pandas.Series
An empty
pandas.Series
with names, dtypes, and index matching the expected output.divisions: tuple of index values
Values along which we partition our blocks on the index
See also
-
abs
()¶ Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns: abs
Series/DataFrame containing the absolute value of each element.
See also
numpy.absolute
- calculate the absolute value element-wise.
Notes
For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
Absolute numeric values in a Series.
>>> s = pd.Series([-1.10, 2, -3.33, 4]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64
Absolute numeric values in a Series with complex numbers.
>>> s = pd.Series([1.2 + 1j]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1.56205 dtype: float64
Absolute numeric values in a Series with a Timedelta element.
>>> s = pd.Series([pd.Timedelta('1 days')]) # doctest: +SKIP >>> s.abs() # doctest: +SKIP 0 1 days dtype: timedelta64[ns]
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({ # doctest: +SKIP ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df # doctest: +SKIP a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] # doctest: +SKIP a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
-
add
(other, level=None, fill_value=None, axis=0)¶ Addition of series and other, element-wise (binary operator add).
Equivalent to
series + other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
align
(other, join='outer', axis=None, fill_value=None)¶ Align two objects on their axes with the specified join method for each axis Index
Parameters: other : DataFrame or Series
join : {‘outer’, ‘inner’, ‘left’, ‘right’}, default ‘outer’
axis : allowed axis of the other object, default None
Align on index (0), columns (1), or both (None)
level : int or level name, default None
Broadcast across a level, matching Index values on the passed MultiIndex level
copy : boolean, default True
Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any “compatible” value
method : str, default None
limit : int, default None
fill_axis : {0 or ‘index’}, default 0
Filling axis, method and limit
broadcast_axis : {0 or ‘index’}, default None
Broadcast values along this axis, if aligning two objects of different dimensions
Returns: (left, right) : (Series, type of other)
Aligned objects
-
all
(axis=None, skipna=True, split_every=False, out=None)¶ Return whether all elements are True, potentially over an axis.
Returns True if all elements within a series or along a Dataframe axis are non-zero, not-empty or not-False.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
- None : reduce all axes, return a scalar.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
**kwargs : any, default None
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: all : Series or DataFrame (if level specified)
See also
pandas.Series.all
- Return True if all elements are True
pandas.DataFrame.any
- Return True if one (or more) elements are True
Examples
Series
>>> pd.Series([True, True]).all() # doctest: +SKIP True >>> pd.Series([True, False]).all() # doctest: +SKIP False
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) # doctest: +SKIP >>> df # doctest: +SKIP col1 col2 0 True True 1 True False
Default behaviour checks if column-wise values all return True.
>>> df.all() # doctest: +SKIP col1 True col2 False dtype: bool
Specify
axis='columns'
to check if row-wise values all return True.>>> df.all(axis='columns') # doctest: +SKIP 0 True 1 False dtype: bool
Or
axis=None
for whether every value is True.>>> df.all(axis=None) # doctest: +SKIP False
-
any
(axis=None, skipna=True, split_every=False, out=None)¶ Return whether any element is True over requested axis.
Unlike
DataFrame.all()
, this performs an or operation. If any of the values along the specified axis is True, this will return True.Parameters: axis : {0 or ‘index’, 1 or ‘columns’, None}, default 0
Indicate which axis or axes should be reduced.
- 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
- 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
- None : reduce all axes, return a scalar.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
bool_only : boolean, default None
Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
**kwargs : any, default None
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: any : Series or DataFrame (if level specified)
See also
pandas.DataFrame.all
- Return whether all elements are True.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([True, False]).any() # doctest: +SKIP True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) # doctest: +SKIP >>> df # doctest: +SKIP A B C 0 1 0 0 1 2 2 0
>>> df.any() # doctest: +SKIP A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 True 1 1 False 2
>>> df.any(axis='columns') # doctest: +SKIP 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 True 1 1 False 0
>>> df.any(axis='columns') # doctest: +SKIP 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None
.>>> df.any(axis=None) # doctest: +SKIP True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() # doctest: +SKIP Series([], dtype: bool)
-
append
(other)¶ Concatenate two or more Series.
Parameters: to_append : Series or list/tuple of Series
ignore_index : boolean, default False
If True, do not use the index labels.
New in version 0.19.0.
verify_integrity : boolean, default False
If True, raise Exception on creating index with duplicates
Returns: appended : Series
See also
pandas.concat
- General function to concatenate DataFrame, Series or Panel objects
Notes
Iteratively appending to a Series can be more computationally intensive than a single concatenate. A better solution is to append values to a list and then concatenate the list with the original Series all at once.
Examples
>>> s1 = pd.Series([1, 2, 3]) # doctest: +SKIP >>> s2 = pd.Series([4, 5, 6]) # doctest: +SKIP >>> s3 = pd.Series([4, 5, 6], index=[3,4,5]) # doctest: +SKIP >>> s1.append(s2) # doctest: +SKIP 0 1 1 2 2 3 0 4 1 5 2 6 dtype: int64
>>> s1.append(s3) # doctest: +SKIP 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
With ignore_index set to True:
>>> s1.append(s2, ignore_index=True) # doctest: +SKIP 0 1 1 2 2 3 3 4 4 5 5 6 dtype: int64
With verify_integrity set to True:
>>> s1.append(s2, verify_integrity=True) # doctest: +SKIP Traceback (most recent call last): ... ValueError: Indexes have overlapping values: [0, 1, 2]
-
apply
(func, convert_dtype=True, meta='__no_default__', args=(), **kwds)¶ Parallel version of pandas.Series.apply
Parameters: func : function
Function to apply
convert_dtype : boolean, default True
Try to find better dtype for elementwise function results. If False, leave as dtype=object.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.args : tuple
Positional arguments to pass to function in addition to the value.
Additional keyword arguments will be passed as keywords to the function.
Returns: applied : Series or DataFrame if func returns a Series.
See also
dask.Series.map_partitions
Examples
>>> import dask.dataframe as dd >>> s = pd.Series(range(5), name='x') >>> ds = dd.from_pandas(s, npartitions=2)
Apply a function elementwise across the Series, passing in extra arguments in
args
andkwargs
:>>> def myadd(x, a, b=1): ... return x + a + b >>> res = ds.apply(myadd, args=(2,), b=1.5)
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with name
'x'
, and dtypefloat64
:>>> res = ds.apply(myadd, args=(2,), b=1.5, meta=('x', 'f8'))
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ds.apply(lambda x: x + 1, meta=ds)
-
astype
(dtype)¶ Cast a pandas object to a specified dtype
dtype
.Parameters: dtype : data type, or dict of column name -> data type
Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
copy : bool, default True.
Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects).errors : {‘raise’, ‘ignore’}, default ‘raise’.
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
New in version 0.20.0.
raise_on_error : raise on invalid input
Deprecated since version 0.20.0: Use
errors
insteadkwargs : keyword arguments to pass on to the constructor
Returns: casted : type of caller
See also
pandas.to_datetime
- Convert argument to datetime.
pandas.to_timedelta
- Convert argument to timedelta.
pandas.to_numeric
- Convert argument to a numeric type.
numpy.ndarray.astype
- Cast a numpy array to a specified type.
Examples
>>> ser = pd.Series([1, 2], dtype='int32') # doctest: +SKIP >>> ser # doctest: +SKIP 0 1 1 2 dtype: int32 >>> ser.astype('int64') # doctest: +SKIP 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') # doctest: +SKIP 0 1 1 2 dtype: category Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> ser.astype('category', ordered=True, categories=[2, 1]) # doctest: +SKIP 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
Note that using
copy=False
and changing data on a new pandas object may propagate changes:>>> s1 = pd.Series([1,2]) # doctest: +SKIP >>> s2 = s1.astype('int64', copy=False) # doctest: +SKIP >>> s2[0] = 10 # doctest: +SKIP >>> s1 # note that s1[0] has changed too # doctest: +SKIP 0 10 1 2 dtype: int64
-
autocorr
(lag=1, split_every=False)¶ Lag-N autocorrelation
Parameters: lag : int, default 1
Number of lags to apply before performing autocorrelation.
Returns: autocorr : float
-
between
(left, right, inclusive=True)¶ Return boolean Series equivalent to left <= series <= right.
This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.
Parameters: left : scalar
Left boundary.
right : scalar
Right boundary.
inclusive : bool, default True
Include boundaries.
Returns: Series
Each element will be a boolean.
See also
pandas.Series.gt
- Greater than of series and other
pandas.Series.lt
- Less than of series and other
Notes
This function is equivalent to
(left <= ser) & (ser <= right)
Examples
>>> s = pd.Series([2, 0, 4, 8, np.nan]) # doctest: +SKIP
Boundary values are included by default:
>>> s.between(1, 4) # doctest: +SKIP 0 True 1 False 2 True 3 False 4 False dtype: bool
With inclusive set to
False
boundary values are excluded:>>> s.between(1, 4, inclusive=False) # doctest: +SKIP 0 True 1 False 2 False 3 False 4 False dtype: bool
left and right can be any scalar value:
>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve']) # doctest: +SKIP >>> s.between('Anna', 'Daniel') # doctest: +SKIP 0 False 1 True 2 True 3 False dtype: bool
-
bfill
(axis=None, limit=None)¶ Synonym for
DataFrame.fillna(method='bfill')
-
clear_divisions
()¶ Forget division information
-
clip
(lower=None, upper=None, out=None)¶ Trim values at input threshold(s).
Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.
Parameters: lower : float or array_like, default None
Minimum threshold value. All values below this threshold will be set to it.
upper : float or array_like, default None
Maximum threshold value. All values above this threshold will be set to it.
axis : int or string axis name, optional
Align object with lower and upper along the given axis.
inplace : boolean, default False
Whether to perform the operation in place on the data.
New in version 0.21.0.
*args, **kwargs
Additional keywords have no effect but might be accepted for compatibility with numpy.
Returns: Series or DataFrame
Same type as calling object with the values outside the clip boundaries replaced
See also
clip_lower
- Clip values below specified threshold(s).
clip_upper
- Clip values above specified threshold(s).
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]} # doctest: +SKIP >>> df = pd.DataFrame(data) # doctest: +SKIP >>> df # doctest: +SKIP col_0 col_1 0 9 -2 1 -3 -7 2 0 6 3 -1 8 4 5 -5
Clips per column using lower and upper thresholds:
>>> df.clip(-4, 6) # doctest: +SKIP col_0 col_1 0 6 -2 1 -3 -4 2 0 6 3 -1 6 4 5 -4
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3]) # doctest: +SKIP >>> t # doctest: +SKIP 0 2 1 -4 2 -1 3 6 4 3 dtype: int64
>>> df.clip(t, t + 4, axis=0) # doctest: +SKIP col_0 col_1 0 6 2 1 -3 -4 2 0 3 3 6 8 4 5 3
-
clip_lower
(threshold)¶ Return copy of the input with values below a threshold truncated.
Parameters: threshold : numeric or array-like
Minimum value allowed. All values below threshold will be set to this value.
- float : every value is compared to threshold.
- array-like : The shape of threshold should match the object
it’s compared to. When self is a Series, threshold should be
the length. When self is a DataFrame, threshold should 2-D
and the same shape as self for
axis=None
, or 1-D and the same length as the axis being compared.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Align self with threshold along the given axis.
inplace : boolean, default False
Whether to perform the operation in place on the data.
New in version 0.21.0.
Returns: clipped : same type as input
See also
Series.clip
- Return copy of input with values below and above thresholds truncated.
Series.clip_upper
- Return copy of input with values above threshold truncated.
Examples
Series single threshold clipping:
>>> s = pd.Series([5, 6, 7, 8, 9]) # doctest: +SKIP >>> s.clip_lower(8) # doctest: +SKIP 0 8 1 8 2 8 3 8 4 9 dtype: int64
Series clipping element-wise using an array of thresholds. threshold should be the same length as the Series.
>>> elemwise_thresholds = [4, 8, 7, 2, 5] # doctest: +SKIP >>> s.clip_lower(elemwise_thresholds) # doctest: +SKIP 0 5 1 8 2 7 3 8 4 9 dtype: int64
DataFrames can be compared to a scalar.
>>> df = pd.DataFrame({"A": [1, 3, 5], "B": [2, 4, 6]}) # doctest: +SKIP >>> df # doctest: +SKIP A B 0 1 2 1 3 4 2 5 6
>>> df.clip_lower(3) # doctest: +SKIP A B 0 3 3 1 3 4 2 5 6
Or to an array of values. By default, threshold should be the same shape as the DataFrame.
>>> df.clip_lower(np.array([[3, 4], [2, 2], [6, 2]])) # doctest: +SKIP A B 0 3 4 1 3 4 2 6 6
Control how threshold is broadcast with axis. In this case threshold should be the same length as the axis specified by axis.
>>> df.clip_lower(np.array([3, 3, 5]), axis='index') # doctest: +SKIP A B 0 3 3 1 3 4 2 5 6
>>> df.clip_lower(np.array([4, 5]), axis='columns') # doctest: +SKIP A B 0 4 5 1 4 5 2 5 6
-
clip_upper
(threshold)¶ Return copy of input with values above given value(s) truncated.
Parameters: threshold : float or array_like
axis : int or string axis name, optional
Align object with threshold along the given axis.
inplace : boolean, default False
Whether to perform the operation in place on the data
New in version 0.21.0.
Returns: clipped : same type as input
See also
-
combine
(other, func, fill_value=None)¶ Perform elementwise binary operation on two Series using given function with optional fill value when an index is missing from one Series or the other
Parameters: other : Series or scalar value
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
Returns: result : Series
See also
Series.combine_first
- Combine Series values, choosing the calling Series’s values first
Examples
>>> s1 = Series([1, 2]) # doctest: +SKIP >>> s2 = Series([0, 3]) # doctest: +SKIP >>> s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2) # doctest: +SKIP 0 0 1 2 dtype: int64
-
combine_first
(other)¶ Combine Series values, choosing the calling Series’s values first. Result index will be the union of the two indexes
Parameters: other : Series Returns: combined : Series See also
Series.combine
- Perform elementwise operation on two Series using a given function
Examples
>>> s1 = pd.Series([1, np.nan]) # doctest: +SKIP >>> s2 = pd.Series([3, 4]) # doctest: +SKIP >>> s1.combine_first(s2) # doctest: +SKIP 0 1.0 1 4.0 dtype: float64
-
compute
(**kwargs)¶ Compute this dask collection
This turns a lazy Dask collection into its in-memory equivalent. For example a Dask.array turns into a
numpy.array()
and a Dask.dataframe turns into a Pandas dataframe. The entire dataset must fit into memory before calling this operation.Parameters: scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs
Extra keywords to forward to the scheduler function.
See also
dask.base.compute
-
copy
()¶ Make a copy of the dataframe
This is strictly a shallow copy of the underlying computational graph. It does not affect the underlying data
-
corr
(other, method='pearson', min_periods=None, split_every=False)¶ Compute correlation with other Series, excluding missing values
Parameters: other : Series
method : {‘pearson’, ‘kendall’, ‘spearman’}
- pearson : standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
min_periods : int, optional
Minimum number of observations needed to have a valid result
Returns: correlation : float
-
count
(split_every=False)¶ Return number of non-NA/null observations in the Series
Parameters: level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a smaller Series
Returns: nobs : int or Series (if level specified)
-
cov
(other, min_periods=None, split_every=False)¶ Compute covariance with Series, excluding missing values
Parameters: other : Series
min_periods : int, optional
Minimum number of observations needed to have a valid result
Returns: covariance : float
Normalized by N-1 (unbiased estimator).
-
cummax
(axis=None, skipna=True, out=None)¶ Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cummax : Series or DataFrame
See also
pandas.core.window.Expanding.max
- Similar functionality but ignores
NaN
values. DataFrame.max
- Return the maximum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummax(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummax() # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1
>>> df.cummax(axis=1) # doctest: +SKIP A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
-
cummin
(axis=None, skipna=True, out=None)¶ Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cummin : Series or DataFrame
See also
pandas.core.window.Expanding.min
- Similar functionality but ignores
NaN
values. DataFrame.min
- Return the minimum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() # doctest: +SKIP 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummin(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummin() # doctest: +SKIP A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1
>>> df.cummin(axis=1) # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
-
cumprod
(axis=None, skipna=True, dtype=None, out=None)¶ Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cumprod : Series or DataFrame
See also
pandas.core.window.Expanding.prod
- Similar functionality but ignores
NaN
values. DataFrame.prod
- Return the product over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() # doctest: +SKIP 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumprod(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumprod() # doctest: +SKIP A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1
>>> df.cumprod(axis=1) # doctest: +SKIP A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
-
cumsum
(axis=None, skipna=True, dtype=None, out=None)¶ Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args, **kwargs :
Additional keywords have no effect but might be accepted for compatibility with NumPy.
Returns: cumsum : Series or DataFrame
See also
pandas.core.window.Expanding.sum
- Similar functionality but ignores
NaN
values. DataFrame.sum
- Return the sum over DataFrame axis.
DataFrame.cummax
- Return cumulative maximum over DataFrame axis.
DataFrame.cummin
- Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
- Return cumulative sum over DataFrame axis.
DataFrame.cumprod
- Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) # doctest: +SKIP >>> s # doctest: +SKIP 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() # doctest: +SKIP 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumsum(skipna=False) # doctest: +SKIP 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], # doctest: +SKIP ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df # doctest: +SKIP A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumsum() # doctest: +SKIP A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1
>>> df.cumsum(axis=1) # doctest: +SKIP A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
-
describe
(split_every=False, percentiles=None)¶ Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.Parameters: percentiles : list-like of numbers, optional
The percentiles to include in the output. All should fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.include : ‘all’, list-like of dtypes or None (default), optional
A white list of data types to include in the result. Ignored for
Series
. Here are the options:- ‘all’ : All columns of the input will be included in the output.
- A list-like of dtypes : Limits the results to the
provided data types.
To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
- None (default) : The result will include all numeric columns.
exclude : list-like of dtypes or None (default), optional,
A black list of data types to omit from the result. Ignored for
Series
. Here are the options:- A list-like of dtypes : Excludes the provided data types
from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
- None (default) : The result will exclude nothing.
Returns: summary: Series/DataFrame of summary statistics
See also
DataFrame.count
,DataFrame.max
,DataFrame.min
,DataFrame.mean
,DataFrame.std
,DataFrame.select_dtypes
Notes
For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DataFrame
are analyzed for the output. The parameters are ignored when analyzing aSeries
.Examples
Describing a numeric
Series
.>>> s = pd.Series([1, 2, 3]) # doctest: +SKIP >>> s.describe() # doctest: +SKIP count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing a categorical
Series
.>>> s = pd.Series(['a', 'a', 'b', 'c']) # doctest: +SKIP >>> s.describe() # doctest: +SKIP count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series
.>>> s = pd.Series([ # doctest: +SKIP ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe() # doctest: +SKIP count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame
. By default only numeric fields are returned.>>> df = pd.DataFrame({ 'object': ['a', 'b', 'c'], # doctest: +SKIP ... 'numeric': [1, 2, 3], ... 'categorical': pd.Categorical(['d','e','f']) ... }) >>> df.describe() # doctest: +SKIP numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrame
regardless of data type.>>> df.describe(include='all') # doctest: +SKIP categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrame
by accessing it as an attribute.>>> df.numeric.describe() # doctest: +SKIP count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFrame
description.>>> df.describe(include=[np.number]) # doctest: +SKIP numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFrame
description.>>> df.describe(include=[np.object]) # doctest: +SKIP object count 3 unique 3 top c freq 1
Including only categorical columns from a
DataFrame
description.>>> df.describe(include=['category']) # doctest: +SKIP categorical count 3 unique 3 top f freq 1
Excluding numeric columns from a
DataFrame
description.>>> df.describe(exclude=[np.number]) # doctest: +SKIP categorical object count 3 3 unique 3 3 top f c freq 1 1
Excluding object columns from a
DataFrame
description.>>> df.describe(exclude=[np.object]) # doctest: +SKIP categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
-
diff
(periods=1, axis=0)¶ First discrete difference of element.
Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is the element in the same column of the previous row).
Parameters: periods : int, default 1
Periods to shift for calculating difference, accepts negative values.
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Take difference over rows (0) or columns (1).
New in version 0.16.1..
Returns: diffed : DataFrame
See also
Series.diff
- First discrete difference for a Series.
DataFrame.pct_change
- Percent change over given number of periods.
DataFrame.shift
- Shift index by desired number of periods with an optional time freq.
Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], # doctest: +SKIP ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df # doctest: +SKIP a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() # doctest: +SKIP a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) # doctest: +SKIP a b c 0 NaN 0.0 0.0 1 NaN -1.0 3.0 2 NaN -1.0 7.0 3 NaN -1.0 13.0 4 NaN 0.0 20.0 5 NaN 2.0 28.0
Difference with 3rd previous row
>>> df.diff(periods=3) # doctest: +SKIP a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) # doctest: +SKIP a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
-
div
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
drop_duplicates
(split_every=None, split_out=1, **kwargs)¶ Return DataFrame with duplicate rows removed, optionally only considering certain columns
Parameters: subset : column label or sequence of labels, optional
Only consider certain columns for identifying duplicates, by default use all of the columns
keep : {‘first’, ‘last’, False}, default ‘first’
first
: Drop duplicates except for the first occurrence.last
: Drop duplicates except for the last occurrence.- False : Drop all duplicates.
inplace : boolean, default False
Whether to drop duplicates in place or to return a copy
Returns: deduplicated : DataFrame
-
dropna
()¶ Return a new Series with missing values removed.
See the User Guide for more on which values are considered missing, and how to work with missing data.
Parameters: axis : {0 or ‘index’}, default 0
There is only one axis to drop values from.
inplace : bool, default False
If True, do operation inplace and return None.
**kwargs
Not in use.
Returns: Series
Series with NA entries dropped from it.
See also
Series.isna
- Indicate missing values.
Series.notna
- Indicate existing (non-missing) values.
Series.fillna
- Replace missing values.
DataFrame.dropna
- Drop rows or columns which contain NA values.
Index.dropna
- Drop missing indices.
Examples
>>> ser = pd.Series([1., 2., np.nan]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 1.0 1 2.0 2 NaN dtype: float64
Drop NA values from a Series.
>>> ser.dropna() # doctest: +SKIP 0 1.0 1 2.0 dtype: float64
Keep the Series with valid entries in the same variable.
>>> ser.dropna(inplace=True) # doctest: +SKIP >>> ser # doctest: +SKIP 0 1.0 1 2.0 dtype: float64
Empty strings are not considered NA values.
None
is considered an NA value.>>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay']) # doctest: +SKIP >>> ser # doctest: +SKIP 0 NaN 1 2 2 NaT 3 4 None 5 I stay dtype: object >>> ser.dropna() # doctest: +SKIP 1 2 3 5 I stay dtype: object
-
dt
¶ Namespace of datetime methods
-
dtype
¶ Return data type
-
eq
(other, level=None, axis=0)¶ Equal to of series and other, element-wise (binary operator eq).
Equivalent to
series == other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
ffill
(axis=None, limit=None)¶ Synonym for
DataFrame.fillna(method='ffill')
-
fillna
(value=None, method=None, limit=None, axis=None)¶ Fill NA/NaN values using the specified method
Parameters: value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill / bfill: use NEXT valid observation to fill gap
axis : {0 or ‘index’, 1 or ‘columns’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any other views on this object, (e.g. a no-copy slice for a column in a DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible)
Returns: filled : DataFrame
See also
interpolate
- Fill NaN values using interpolation.
reindex
,asfreq
Examples
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0], # doctest: +SKIP ... [3, 4, np.nan, 1], ... [np.nan, np.nan, np.nan, 5], ... [np.nan, 3, np.nan, 4]], ... columns=list('ABCD')) >>> df # doctest: +SKIP A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 NaN NaN NaN 5 3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0) # doctest: +SKIP A B C D 0 0.0 2.0 0.0 0 1 3.0 4.0 0.0 1 2 0.0 0.0 0.0 5 3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method='ffill') # doctest: +SKIP A B C D 0 NaN 2.0 NaN 0 1 3.0 4.0 NaN 1 2 3.0 4.0 NaN 5 3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3} # doctest: +SKIP >>> df.fillna(value=values) # doctest: +SKIP A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 2.0 1 2 0.0 1.0 2.0 5 3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1) # doctest: +SKIP A B C D 0 0.0 2.0 2.0 0 1 3.0 4.0 NaN 1 2 NaN 1.0 NaN 5 3 NaN 3.0 NaN 4
-
first
(offset)¶ Convenience method for subsetting initial periods of time series data based on a date offset.
Parameters: offset : string, DateOffset, dateutil.relativedelta
Returns: subset : type of caller
Raises: TypeError
If the index is not a
DatetimeIndex
See also
last
- Select final periods of time series based on a date offset
at_time
- Select values at a particular time of the day
between_time
- Select values between particular times of the day
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') # doctest: +SKIP >>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i) # doctest: +SKIP >>> ts # doctest: +SKIP A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the first 3 days:
>>> ts.first('3D') # doctest: +SKIP A 2018-04-09 1 2018-04-11 2
Notice the data for 3 first calender days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
-
floordiv
(other, level=None, fill_value=None, axis=0)¶ Integer division of series and other, element-wise (binary operator floordiv).
Equivalent to
series // other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
ge
(other, level=None, axis=0)¶ Greater than or equal to of series and other, element-wise (binary operator ge).
Equivalent to
series >= other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
get_partition
(n)¶ Get a dask DataFrame/Series representing the nth partition.
-
groupby
(by=None, **kwargs)¶ Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.
Parameters: by : mapping, function, label, or list of labels
Used to determine the groups for the groupby. If
by
is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()
method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns inself
. Notice that a tuple is interpreted a (single) key.axis : int, default 0
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : boolean, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
squeeze : boolean, default False
reduce the dimensionality of the return type if possible, otherwise return a consistent type
observed : boolean, default False
This only applies if any of the groupers are Categoricals If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
New in version 0.23.0.
Returns: GroupBy object
See also
resample
- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
DataFrame results
>>> data.groupby(func, axis=0).mean() # doctest: +SKIP >>> data.groupby(['col1', 'col2'])['col3'].mean() # doctest: +SKIP
DataFrame with hierarchical index
>>> data.groupby(['col1', 'col2']).mean() # doctest: +SKIP
-
gt
(other, level=None, axis=0)¶ Greater than of series and other, element-wise (binary operator gt).
Equivalent to
series > other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
head
(n=5, npartitions=1, compute=True)¶ First n rows of the dataset
Parameters: n : int, optional
The number of rows to return. Default is 5.
npartitions : int, optional
Elements are only taken from the first
npartitions
, with a default of 1. If there are fewer thann
rows in the firstnpartitions
a warning will be raised and any found rows returned. Pass -1 to use all partitions.compute : bool, optional
Whether to compute the result, default is True.
-
idxmax
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of maximum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: idxmax : Series
Raises: ValueError
- If the row/column is empty
See also
Notes
This method is the DataFrame version of
ndarray.argmax
.
-
idxmin
(axis=None, skipna=True, split_every=False)¶ Return index of first occurrence of minimum over requested axis. NA/null values are excluded.
Parameters: axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA.
Returns: idxmin : Series
Raises: ValueError
- If the row/column is empty
See also
Notes
This method is the DataFrame version of
ndarray.argmin
.
-
index
¶ Return dask Index instance
-
isin
(values)¶ Check whether values are contained in Series.
Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.
Parameters: values : set or list-like
The sequence of values to test. Passing in a single string will raise a
TypeError
. Instead, turn a single string into a list of one element.New in version 0.18.1: Support for values as a set.
Returns: isin : Series (bool dtype)
Raises: TypeError
- If values is a string
See also
pandas.DataFrame.isin
- equivalent method on DataFrame
Examples
>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', # doctest: +SKIP ... 'hippo'], name='animal') >>> s.isin(['cow', 'lama']) # doctest: +SKIP 0 True 1 True 2 True 3 False 4 True 5 False Name: animal, dtype: bool
Passing a single string as
s.isin('lama')
will raise an error. Use a list of one element instead:>>> s.isin(['lama']) # doctest: +SKIP 0 True 1 False 2 True 3 False 4 True 5 False Name: animal, dtype: bool
-
isna
()¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.isnull
- alias of isna
DataFrame.notna
- boolean inverse of isna
DataFrame.dropna
- omit axes labels with missing values
isna
- top-level isna
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], # doctest: +SKIP ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() # doctest: +SKIP age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() # doctest: +SKIP 0 False 1 False 2 True dtype: bool
-
isnull
()¶ Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).Returns: DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.isnull
- alias of isna
DataFrame.notna
- boolean inverse of isna
DataFrame.dropna
- omit axes labels with missing values
isna
- top-level isna
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], # doctest: +SKIP ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() # doctest: +SKIP age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() # doctest: +SKIP 0 False 1 False 2 True dtype: bool
-
iteritems
()¶ Lazily iterate over (index, value) tuples
-
known_divisions
¶ Whether divisions are already known
-
last
(offset)¶ Convenience method for subsetting final periods of time series data based on a date offset.
Parameters: offset : string, DateOffset, dateutil.relativedelta
Returns: subset : type of caller
Raises: TypeError
If the index is not a
DatetimeIndex
See also
first
- Select initial periods of time series based on a date offset
at_time
- Select values at a particular time of the day
between_time
- Select values between particular times of the day
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') # doctest: +SKIP >>> ts = pd.DataFrame({'A': [1,2,3,4]}, index=i) # doctest: +SKIP >>> ts # doctest: +SKIP A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the last 3 days:
>>> ts.last('3D') # doctest: +SKIP A 2018-04-13 3 2018-04-15 4
Notice the data for 3 last calender days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
-
le
(other, level=None, axis=0)¶ Less than or equal to of series and other, element-wise (binary operator le).
Equivalent to
series <= other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
loc
¶ Purely label-location based indexer for selection by label.
>>> df.loc["b"] # doctest: +SKIP >>> df.loc["b":"d"] # doctest: +SKIP
-
lt
(other, level=None, axis=0)¶ Less than of series and other, element-wise (binary operator lt).
Equivalent to
series < other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
map
(arg, na_action=None, meta='__no_default__')¶ Map values of Series using input correspondence (a dict, Series, or function).
Parameters: arg : function, dict, or Series
Mapping correspondence.
na_action : {None, ‘ignore’}
If ‘ignore’, propagate NA values, without passing them to the mapping correspondence.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: y : Series
Same index as caller.
See also
Series.apply
- For applying more complex functions on a Series.
DataFrame.apply
- Apply a function row-/column-wise.
DataFrame.applymap
- Apply a function elementwise on a whole DataFrame.
Notes
When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to
NaN
. However, if the dictionary is adict
subclass that defines__missing__
(i.e. provides a method for default values), then this default is used rather thanNaN
:>>> from collections import Counter # doctest: +SKIP >>> counter = Counter() # doctest: +SKIP >>> counter['bar'] += 1 # doctest: +SKIP >>> y.map(counter) # doctest: +SKIP 1 0 2 1 3 0 dtype: int64
Examples
Map inputs to outputs (both of type Series):
>>> x = pd.Series([1,2,3], index=['one', 'two', 'three']) # doctest: +SKIP >>> x # doctest: +SKIP one 1 two 2 three 3 dtype: int64
>>> y = pd.Series(['foo', 'bar', 'baz'], index=[1,2,3]) # doctest: +SKIP >>> y # doctest: +SKIP 1 foo 2 bar 3 baz
>>> x.map(y) # doctest: +SKIP one foo two bar three baz
If arg is a dictionary, return a new Series with values converted according to the dictionary’s mapping:
>>> z = {1: 'A', 2: 'B', 3: 'C'} # doctest: +SKIP
>>> x.map(z) # doctest: +SKIP one A two B three C
Use na_action to control whether NA values are affected by the mapping function.
>>> s = pd.Series([1, 2, 3, np.nan]) # doctest: +SKIP
>>> s2 = s.map('this is a string {}'.format, na_action=None) # doctest: +SKIP 0 this is a string 1.0 1 this is a string 2.0 2 this is a string 3.0 3 this is a string nan dtype: object
>>> s3 = s.map('this is a string {}'.format, na_action='ignore') # doctest: +SKIP 0 this is a string 1.0 1 this is a string 2.0 2 this is a string 3.0 3 NaN dtype: object
-
map_overlap
(func, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
This can be useful for implementing windowing functions such as
df.rolling(...).mean()
ordf.diff()
.Parameters: func : function
Function applied to each partition.
before : int
The number of rows to prepend to partition
i
from the end of partitioni - 1
.after : int
The number of rows to append to partition
i
from the beginning of partitioni + 1
.args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Notes
Given positive integers
before
andafter
, and a functionfunc
,map_overlap
does the following:- Prepend
before
rows to each partitioni
from the end of partitioni - 1
. The first partition has no rows prepended. - Append
after
rows to each partitioni
from the beginning of partitioni + 1
. The last partition has no rows appended. - Apply
func
to each partition, passing in any extraargs
andkwargs
if provided. - Trim
before
rows from the beginning of all but the first partition. - Trim
after
rows from the end of all but the last partition.
Note that the index and divisions are assumed to remain unchanged.
Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 4, 7, 11], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
A rolling sum with a trailing moving window of size 2 can be computed by overlapping 2 rows before each partition, and then mapping calls to
df.rolling(2).sum()
:>>> ddf.compute() x y 0 1 1.0 1 2 2.0 2 4 3.0 3 7 4.0 4 11 5.0 >>> ddf.map_overlap(lambda df: df.rolling(2).sum(), 2, 0).compute() x y 0 NaN NaN 1 3.0 3.0 2 6.0 5.0 3 11.0 7.0 4 18.0 9.0
The pandas
diff
method computes a discrete difference shifted by a number of periods (can be positive or negative). This can be implemented by mapping calls todf.diff
to each partition after prepending/appending that many rows, depending on sign:>>> def diff(df, periods=1): ... before, after = (periods, 0) if periods > 0 else (0, -periods) ... return df.map_overlap(lambda df, periods=1: df.diff(periods), ... periods, 0, periods=periods) >>> diff(ddf, 1).compute() x y 0 NaN NaN 1 1.0 1.0 2 2.0 1.0 3 3.0 1.0 4 4.0 1.0
If you have a
DatetimeIndex
, you can use apd.Timedelta
for time- based windows.>>> ts = pd.Series(range(10), index=pd.date_range('2017', periods=10)) >>> dts = dd.from_pandas(ts, npartitions=2) >>> dts.map_overlap(lambda df: df.rolling('2D').sum(), ... pd.Timedelta('2D'), 0).compute() 2017-01-01 0.0 2017-01-02 1.0 2017-01-03 3.0 2017-01-04 5.0 2017-01-05 7.0 2017-01-06 9.0 2017-01-07 11.0 2017-01-08 13.0 2017-01-09 15.0 2017-01-10 17.0 dtype: float64
- Prepend
-
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Note that the index and divisions are assumed to remain unchanged.
Parameters: func : function
Function applied to each partition.
args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after. Arguments and keywords may contain
Scalar
,Delayed
or regular python objects.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Examples
Given a DataFrame, Series, or Index, such as:
>>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': [1, 2, 3, 4, 5], ... 'y': [1., 2., 3., 4., 5.]}) >>> ddf = dd.from_pandas(df, npartitions=2)
One can use
map_partitions
to apply a function on each partition. Extra arguments and keywords can optionally be provided, and will be passed to the function after the partition.Here we apply a function with arguments and keywords to a DataFrame, resulting in a Series:
>>> def myadd(df, a, b=1): ... return df.x + df.y + a + b >>> res = ddf.map_partitions(myadd, 1, b=2) >>> res.dtype dtype('float64')
By default, dask tries to infer the output metadata by running your provided function on some fake data. This works well in many cases, but can sometimes be expensive, or even fail. To avoid this, you can manually specify the output metadata with the
meta
keyword. This can be specified in many forms, for more information seedask.dataframe.utils.make_meta
.Here we specify the output is a Series with no name, and dtype
float64
:>>> res = ddf.map_partitions(myadd, 1, b=2, meta=(None, 'f8'))
Here we map a function that takes in a DataFrame, and returns a DataFrame with a new column:
>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y)) >>> res.dtypes x int64 y float64 z float64 dtype: object
As before, the output metadata can also be specified manually. This time we pass in a
dict
, as the output is a DataFrame:>>> res = ddf.map_partitions(lambda df: df.assign(z=df.x * df.y), ... meta={'x': 'i8', 'y': 'f8', 'z': 'f8'})
In the case where the metadata doesn’t change, you can also pass in the object itself directly:
>>> res = ddf.map_partitions(lambda df: df.head(), meta=df)
Also note that the index and divisions are assumed to remain unchanged. If the function you’re mapping changes the index/divisions, you’ll need to clear them afterwards:
>>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP
-
mask
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is False and otherwise are from other.
Parameters: cond : boolean NDFrame, array-like, or callable
Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as cond.
other : scalar, NDFrame, or callable
Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
errors : str, {‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Deprecated since version 0.21.0.
Returns: wh : same type as caller
See also
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isFalse
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
mask
documentation in indexing.Examples
>>> s = pd.Series(range(5)) # doctest: +SKIP >>> s.where(s > 0) # doctest: +SKIP 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> s.mask(s > 0) # doctest: +SKIP 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN
>>> s.where(s > 1, 10) # doctest: +SKIP 0 10.0 1 10.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) # doctest: +SKIP >>> m = df % 3 == 0 # doctest: +SKIP >>> df.where(m, -df) # doctest: +SKIP A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
max
(axis=None, skipna=True, split_every=False, out=None)¶ - This method returns the maximum of the values in the object.
- If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: max : Series or DataFrame (if level specified)
-
mean
(axis=None, skipna=True, split_every=False, dtype=None, out=None)¶ Return the mean of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: mean : Series or DataFrame (if level specified)
-
memory_usage
(index=True, deep=False)¶ Return the memory usage of the Series.
The memory usage can optionally include the contribution of the index and of elements of object dtype.
Parameters: index : bool, default True
Specifies whether to include the memory usage of the Series index.
deep : bool, default False
If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned value.
Returns: int
Bytes of memory consumed.
See also
numpy.ndarray.nbytes
- Total bytes consumed by the elements of the array.
DataFrame.memory_usage
- Bytes consumed by a DataFrame.
Examples
>>> s = pd.Series(range(3)) # doctest: +SKIP >>> s.memory_usage() # doctest: +SKIP 104
Not including the index gives the size of the rest of the data, which is necessarily smaller:
>>> s.memory_usage(index=False) # doctest: +SKIP 24
The memory footprint of object values is ignored by default:
>>> s = pd.Series(["a", "b"]) # doctest: +SKIP >>> s.values # doctest: +SKIP array(['a', 'b'], dtype=object) >>> s.memory_usage() # doctest: +SKIP 96 >>> s.memory_usage(deep=True) # doctest: +SKIP 212
-
min
(axis=None, skipna=True, split_every=False, out=None)¶ - This method returns the minimum of the values in the object.
- If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: min : Series or DataFrame (if level specified)
-
mod
(other, level=None, fill_value=None, axis=0)¶ Modulo of series and other, element-wise (binary operator mod).
Equivalent to
series % other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
mul
(other, level=None, fill_value=None, axis=0)¶ Multiplication of series and other, element-wise (binary operator mul).
Equivalent to
series * other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
nbytes
¶ Number of bytes
-
ndim
¶ Return dimensionality
-
ne
(other, level=None, axis=0)¶ Not equal to of series and other, element-wise (binary operator ne).
Equivalent to
series != other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Series.None
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
nlargest
(n=5, split_every=None)¶ Return the largest n elements.
Parameters: n : int
Return this many descending sorted values
keep : {‘first’, ‘last’}, default ‘first’
Where there are duplicate values: -
first
: take the first occurrence. -last
: take the last occurrence.Returns: top_n : Series
The n largest values in the Series, in sorted order
See also
Notes
Faster than
.sort_values(ascending=False).head(n)
for small n relative to the size of theSeries
object.Examples
>>> import pandas as pd # doctest: +SKIP >>> import numpy as np # doctest: +SKIP >>> s = pd.Series(np.random.randn(10**6)) # doctest: +SKIP >>> s.nlargest(10) # only sorts up to the N requested # doctest: +SKIP 219921 4.644710 82124 4.608745 421689 4.564644 425277 4.447014 718691 4.414137 43154 4.403520 283187 4.313922 595519 4.273635 503969 4.250236 121637 4.240952 dtype: float64
-
notnull
()¶ Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.Returns: DataFrame
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
See also
DataFrame.notnull
- alias of notna
DataFrame.isna
- boolean inverse of notna
DataFrame.dropna
- omit axes labels with missing values
notna
- top-level notna
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame({'age': [5, 6, np.NaN], # doctest: +SKIP ... 'born': [pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... 'name': ['Alfred', 'Batman', ''], ... 'toy': [None, 'Batmobile', 'Joker']}) >>> df # doctest: +SKIP age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() # doctest: +SKIP age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) # doctest: +SKIP >>> ser # doctest: +SKIP 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() # doctest: +SKIP 0 True 1 True 2 False dtype: bool
-
npartitions
¶ Return number of partitions
-
nsmallest
(n=5, split_every=None)¶ Return the smallest n elements.
Parameters: n : int
Return this many ascending sorted values
keep : {‘first’, ‘last’}, default ‘first’
Where there are duplicate values: -
first
: take the first occurrence. -last
: take the last occurrence.Returns: bottom_n : Series
The n smallest values in the Series, in sorted order
See also
Notes
Faster than
.sort_values().head(n)
for small n relative to the size of theSeries
object.Examples
>>> import pandas as pd # doctest: +SKIP >>> import numpy as np # doctest: +SKIP >>> s = pd.Series(np.random.randn(10**6)) # doctest: +SKIP >>> s.nsmallest(10) # only sorts up to the N requested # doctest: +SKIP 288532 -4.954580 732345 -4.835960 64803 -4.812550 446457 -4.609998 501225 -4.483945 669476 -4.472935 973615 -4.401699 621279 -4.355126 773916 -4.347355 359919 -4.331927 dtype: float64
-
nunique
(split_every=None)¶ Return number of unique elements in the object.
Excludes NA values by default.
Parameters: dropna : boolean, default True
Don’t include NaN in the count.
Returns: nunique : int
-
nunique_approx
(split_every=None)¶ Approximate number of unique rows.
This method uses the HyperLogLog algorithm for cardinality estimation to compute the approximate number of unique rows. The approximate error is 0.406%.
Parameters: split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used. Default is 8.
Returns: a float representing the approximate number of elements
-
partitions
¶ Slice dataframe by partitions
This allows partitionwise slicing of a Dask Dataframe. You can perform normal Numpy-style slicing but now rather than slice elements of the array you slice along partitions so, for example,
df.partitions[:5]
produces a new Dask Dataframe of the first five partitions.Returns: A Dask DataFrame Examples
>>> df.partitions[0] # doctest: +SKIP >>> df.partitions[:3] # doctest: +SKIP >>> df.partitions[::10] # doctest: +SKIP
-
persist
(**kwargs)¶ Persist this dask collection into memory
This turns a lazy Dask collection into a Dask collection with the same metadata, but now with the results fully computed or actively computing in the background.
The action of function differs significantly depending on the active task scheduler. If the task scheduler supports asynchronous computing, such as is the case of the dask.distributed scheduler, then persist will return immediately and the return value’s task graph will contain Dask Future objects. However if the task scheduler only supports blocking computation then the call to persist will block and the return value’s task graph will contain concrete Python results.
This function is particularly useful when using distributed systems, because the results will be kept in distributed memory, rather than returned to the local process as with compute.
Parameters: scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph : bool, optional
If True [default], the graph is optimized before computation. Otherwise the graph is run as is. This can be useful for debugging.
**kwargs
Extra keywords to forward to the scheduler function.
Returns: New dask collections backed by in-memory data
See also
dask.base.persist
-
pipe
(func, *args, **kwargs)¶ Apply func(self, *args, **kwargs)
Parameters: func : function
function to apply to the NDFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the NDFrame.args : iterable, optional
positional arguments passed into
func
.kwargs : mapping, optional
a dictionary of keyword arguments passed into
func
.Returns: object : the return type of
func
.Notes
Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing>>> f(g(h(df), arg1=a), arg2=b, arg3=c) # doctest: +SKIP
You can write
>>> (df.pipe(h) # doctest: +SKIP ... .pipe(g, arg1=a) ... .pipe(f, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) # doctest: +SKIP ... .pipe(g, arg1=a) ... .pipe((f, 'arg2'), arg1=a, arg3=c) ... )
-
pow
(other, level=None, fill_value=None, axis=0)¶ Exponential power of series and other, element-wise (binary operator pow).
Equivalent to
series ** other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
prod
(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶ Return the product of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns: prod : Series or DataFrame (if level specified)
Examples
By default, the product of an empty or all-NA Series is
1
>>> pd.Series([]).prod() # doctest: +SKIP 1.0
This can be controlled with the
min_count
parameter>>> pd.Series([]).prod(min_count=1) # doctest: +SKIP nan
Thanks to the
skipna
parameter,min_count
handles all-NA and empty series identically.>>> pd.Series([np.nan]).prod() # doctest: +SKIP 1.0
>>> pd.Series([np.nan]).prod(min_count=1) # doctest: +SKIP nan
-
quantile
(q=0.5)¶ Approximate quantiles of Series
- q : list/array of floats, default 0.5 (50%)
- Iterable of numbers ranging from 0 to 1 for the desired quantiles
-
radd
(other, level=None, fill_value=None, axis=0)¶ Addition of series and other, element-wise (binary operator radd).
Equivalent to
other + series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
random_split
(frac, random_state=None)¶ Pseudorandomly split dataframe into different pieces row-wise
Parameters: frac : list
List of floats that should sum to one.
random_state: int or np.random.RandomState
If int create a new RandomState with this as the seed
Otherwise draw from the passed RandomState
See also
dask.DataFrame.sample
Examples
50/50 split
>>> a, b = df.random_split([0.5, 0.5]) # doctest: +SKIP
80/10/10 split, consistent random_state
>>> a, b, c = df.random_split([0.8, 0.1, 0.1], random_state=123) # doctest: +SKIP
-
rdiv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
reduction
(chunk, aggregate=None, combine=None, meta='__no_default__', token=None, split_every=None, chunk_kwargs=None, aggregate_kwargs=None, combine_kwargs=None, **kwargs)¶ Generic row-wise reductions.
Parameters: chunk : callable
Function to operate on each partition. Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.aggregate : callable, optional
Function to operate on the concatenated result of
chunk
. If not specified, defaults tochunk
. Used to do the final aggregation in a tree reduction.The input to
aggregate
depends on the output ofchunk
. If the output ofchunk
is a:- scalar: Input is a Series, with one row per partition.
- Series: Input is a DataFrame, with one row per partition. Columns are the rows in the output series.
- DataFrame: Input is a DataFrame, with one row per partition. Columns are the columns in the output dataframes.
Should return a
pandas.DataFrame
,pandas.Series
, or a scalar.combine : callable, optional
Function to operate on intermediate concatenated results of
chunk
in a tree-reduction. If not provided, defaults toaggregate
. The input/output requirements should match that ofaggregate
described above.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.token : str, optional
The name to use for the output keys.
split_every : int, optional
Group partitions into groups of this size while performing a tree-reduction. If set to False, no tree-reduction will be used, and all intermediates will be concatenated and passed to
aggregate
. Default is 8.chunk_kwargs : dict, optional
Keyword arguments to pass on to
chunk
only.aggregate_kwargs : dict, optional
Keyword arguments to pass on to
aggregate
only.combine_kwargs : dict, optional
Keyword arguments to pass on to
combine
only.kwargs :
All remaining keywords will be passed to
chunk
,combine
, andaggregate
.Examples
>>> import pandas as pd >>> import dask.dataframe as dd >>> df = pd.DataFrame({'x': range(50), 'y': range(50, 100)}) >>> ddf = dd.from_pandas(df, npartitions=4)
Count the number of rows in a DataFrame. To do this, count the number of rows in each partition, then sum the results:
>>> res = ddf.reduction(lambda x: x.count(), ... aggregate=lambda x: x.sum()) >>> res.compute() x 50 y 50 dtype: int64
Count the number of rows in a Series with elements greater than or equal to a value (provided via a keyword).
>>> def count_greater(x, value=0): ... return (x >= value).sum() >>> res = ddf.x.reduction(count_greater, aggregate=lambda x: x.sum(), ... chunk_kwargs={'value': 25}) >>> res.compute() 25
Aggregate both the sum and count of a Series at the same time:
>>> def sum_and_count(x): ... return pd.Series({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.x.reduction(sum_and_count, aggregate=lambda x: x.sum()) >>> res.compute() count 50 sum 1225 dtype: int64
Doing the same, but for a DataFrame. Here
chunk
returns a DataFrame, meaning the input toaggregate
is a DataFrame with an index with non-unique entries for both ‘x’ and ‘y’. We groupby the index, and sum each group to get the final result.>>> def sum_and_count(x): ... return pd.DataFrame({'sum': x.sum(), 'count': x.count()}) >>> res = ddf.reduction(sum_and_count, ... aggregate=lambda x: x.groupby(level=0).sum()) >>> res.compute() count sum x 50 1225 y 50 3725
-
rename
(index=None, inplace=False, sorted_index=False)¶ Alter Series index labels or name
Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.
Alternatively, change
Series.name
with a scalar value.Parameters: index : scalar, hashable sequence, dict-like or callable, optional
If dict-like or callable, the transformation is applied to the index. Scalar or hashable sequence-like will alter the
Series.name
attribute.inplace : boolean, default False
Whether to return a new Series or modify this one inplace.
sorted_index : bool, default False
If true, the output
Series
will have known divisions inferred from the input series and the transformation. Ignored for non-callable/dict-likeindex
or when the input series has unknown divisions. Note that this may only be set toTrue
if you know that the transformed index is monotonicly increasing. Dask will check that transformed divisions are monotonic, but cannot check all the values between divisions, so incorrectly setting this can result in bugs.Returns: renamed : Series
See also
-
repartition
(divisions=None, npartitions=None, freq=None, force=False)¶ Repartition dataframe along new divisions
Parameters: divisions : list, optional
List of partitions to be used. If specified npartitions will be ignored.
npartitions : int, optional
Number of partitions of output. Only used if divisions isn’t specified.
freq : str, pd.Timedelta
A period on which to partition timeseries data like
'7D'
or'12h'
orpd.Timedelta(hours=12)
. Assumes a datetime index.force : bool, default False
Allows the expansion of the existing divisions. If False then the new divisions lower and upper bounds must be the same as the old divisions.
Examples
>>> df = df.repartition(npartitions=10) # doctest: +SKIP >>> df = df.repartition(divisions=[0, 5, 10, 20]) # doctest: +SKIP >>> df = df.repartition(freq='7d') # doctest: +SKIP
-
resample
(rule, closed=None, label=None)¶ Convenience method for frequency conversion and resampling of time series. Object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or pass datetime-like values to the on or level keyword.
Parameters: rule : string
the offset string or object representing target conversion
axis : int, optional, default 0
closed : {‘right’, ‘left’}
Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label : {‘right’, ‘left’}
Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention : {‘start’, ‘end’, ‘s’, ‘e’}
For PeriodIndex only, controls whether to use the start or end of rule
kind: {‘timestamp’, ‘period’}, optional
Pass ‘timestamp’ to convert the resulting index to a
DateTimeIndex
or ‘period’ to convert it to aPeriodIndex
. By default the input representation is retained.loffset : timedelta
Adjust the resampled time labels
base : int, default 0
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0
on : string, optional
For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
New in version 0.19.0.
level : string or int, optional
For a MultiIndex, level (name or number) to use for resampling. Level must be datetime-like.
New in version 0.19.0.
Returns: Resampler object
See also
groupby
- Group by mapping, function, label, or list of labels.
Notes
See the user guide for more.
To learn more about the offset strings, please see this link.
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') # doctest: +SKIP >>> series = pd.Series(range(9), index=index) # doctest: +SKIP >>> series # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() # doctest: +SKIP 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00
does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() # doctest: +SKIP 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] #select first 5 rows # doctest: +SKIP 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] # doctest: +SKIP 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(array_like): # doctest: +SKIP ... return np.sum(array_like)+5
>>> series.resample('3T').apply(custom_resampler) # doctest: +SKIP 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', # doctest: +SKIP freq='A', periods=2)) >>> s # doctest: +SKIP 2012 1 2013 2 Freq: A-DEC, dtype: int64
Resample by month using ‘start’ convention. Values are assigned to the first month of the period.
>>> s.resample('M', convention='start').asfreq().head() # doctest: +SKIP 2012-01 1.0 2012-02 NaN 2012-03 NaN 2012-04 NaN 2012-05 NaN Freq: M, dtype: float64
Resample by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> s.resample('M', convention='end').asfreq() # doctest: +SKIP 2012-12 1.0 2013-01 NaN 2013-02 NaN 2013-03 NaN 2013-04 NaN 2013-05 NaN 2013-06 NaN 2013-07 NaN 2013-08 NaN 2013-09 NaN 2013-10 NaN 2013-11 NaN 2013-12 2.0 Freq: M, dtype: float64
For DataFrame objects, the keyword
on
can be used to specify the column instead of the index for resampling.>>> df = pd.DataFrame(data=9*[range(4)], columns=['a', 'b', 'c', 'd']) # doctest: +SKIP >>> df['time'] = pd.date_range('1/1/2000', periods=9, freq='T') # doctest: +SKIP >>> df.resample('3T', on='time').sum() # doctest: +SKIP a b c d time 2000-01-01 00:00:00 0 3 6 9 2000-01-01 00:03:00 0 3 6 9 2000-01-01 00:06:00 0 3 6 9
For a DataFrame with MultiIndex, the keyword
level
can be used to specify on level the resampling needs to take place.>>> time = pd.date_range('1/1/2000', periods=5, freq='T') # doctest: +SKIP >>> df2 = pd.DataFrame(data=10*[range(4)], # doctest: +SKIP columns=['a', 'b', 'c', 'd'], index=pd.MultiIndex.from_product([time, [1, 2]]) ) >>> df2.resample('3T', level=0).sum() # doctest: +SKIP a b c d 2000-01-01 00:00:00 0 6 12 18 2000-01-01 00:03:00 0 4 8 12
-
reset_index
(drop=False)¶ Reset the index to the default index.
Note that unlike in
pandas
, the resetdask.dataframe
index will not be monotonically increasing from 0. Instead, it will restart at 0 for each partition (e.g.index1 = [0, ..., 10], index2 = [0, ...]
). This is due to the inability to statically know the full length of the index.For DataFrame with multi-level index, returns a new DataFrame with labeling information in the columns under the index names, defaulting to ‘level_0’, ‘level_1’, etc. if any are None. For a standard index, the index name will be used (if set), otherwise a default ‘index’ or ‘level_0’ (if ‘index’ is already taken) will be used.
Parameters: drop : boolean, default False
Do not try to insert index into dataframe columns.
-
rfloordiv
(other, level=None, fill_value=None, axis=0)¶ Integer division of series and other, element-wise (binary operator rfloordiv).
Equivalent to
other // series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
rmod
(other, level=None, fill_value=None, axis=0)¶ Modulo of series and other, element-wise (binary operator rmod).
Equivalent to
other % series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
rmul
(other, level=None, fill_value=None, axis=0)¶ Multiplication of series and other, element-wise (binary operator rmul).
Equivalent to
other * series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
rolling
(window, min_periods=None, freq=None, center=False, win_type=None, axis=0)¶ Provides rolling transformations.
Parameters: window : int, str, offset
Size of the moving window. This is the number of observations used for calculating the statistic. The window size must not be so large as to span more than one adjacent partition. If using an offset or offset alias like ‘5D’, the data must have a
DatetimeIndex
Changed in version 0.15.0: Now accepts offsets and string offset aliases
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA).
center : boolean, default False
Set the labels at the center of the window.
win_type : string, default None
Provide a window type. The recognized window types are identical to pandas.
axis : int, default 0
Returns: a Rolling object on which to call a method to compute a statistic
Notes
The freq argument is not supported.
-
round
(decimals=0)¶ Round each value in a Series to the given number of decimals.
Parameters: decimals : int
Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point.
Returns: Series object
See also
-
rpow
(other, level=None, fill_value=None, axis=0)¶ Exponential power of series and other, element-wise (binary operator rpow).
Equivalent to
other ** series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
rsub
(other, level=None, fill_value=None, axis=0)¶ Subtraction of series and other, element-wise (binary operator rsub).
Equivalent to
other - series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
rtruediv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator rtruediv).
Equivalent to
other / series
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
sample
(n=None, frac=None, replace=False, random_state=None)¶ Random sample of items
Parameters: n : int, optional
Number of items to return is not supported by dask. Use frac instead.
frac : float, optional
Fraction of axis items to return.
replace : boolean, optional
Sample with or without replacement. Default = False.
random_state : int or
np.random.RandomState
If int we create a new RandomState with this as the seed Otherwise we draw from the passed RandomState
-
sem
(axis=None, skipna=None, ddof=1, split_every=False)¶ Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: sem : Series or DataFrame (if level specified)
-
shape
¶ Return a tuple representing the dimensionality of a Series.
The single element of the tuple is a Delayed result.
Examples
>>> series.shape # doctest: +SKIP # (dd.Scalar<size-ag..., dtype=int64>,)
-
shift
(periods=1, freq=None, axis=0)¶ Shift index by desired number of periods with an optional time freq
Parameters: periods : int
Number of periods to move, can be positive or negative
freq : DateOffset, timedelta, or time rule string, optional
Increment to use from the tseries module or time rule (e.g. ‘EOM’). See Notes.
axis : {0 or ‘index’, 1 or ‘columns’}
Returns: shifted : DataFrame
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
-
size
¶ Size of the Series or DataFrame as a Delayed object.
Examples
>>> series.size # doctest: +SKIP dd.Scalar<size-ag..., dtype=int64>
-
squeeze
()¶ Squeeze length 1 dimensions.
Parameters: axis : None, integer or string axis name, optional
The axis to squeeze if 1-sized.
New in version 0.20.0.
Returns: scalar if 1-sized, else original object
-
std
(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶ Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: std : Series or DataFrame (if level specified)
-
str
¶ Namespace for string methods
-
sub
(other, level=None, fill_value=None, axis=0)¶ Subtraction of series and other, element-wise (binary operator sub).
Equivalent to
series - other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
sum
(axis=None, skipna=True, split_every=False, dtype=None, out=None, min_count=None)¶ Return the sum of the values for the requested axis
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values when computing the result.
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
min_count : int, default 0
The required number of valid values to perform the operation. If fewer than
min_count
non-NA values are present the result will be NA.New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
Returns: sum : Series or DataFrame (if level specified)
Examples
By default, the sum of an empty or all-NA Series is
0
.>>> pd.Series([]).sum() # min_count=0 is the default # doctest: +SKIP 0.0
This can be controlled with the
min_count
parameter. For example, if you’d like the sum of an empty series to be NaN, passmin_count=1
.>>> pd.Series([]).sum(min_count=1) # doctest: +SKIP nan
Thanks to the
skipna
parameter,min_count
handles all-NA and empty series identically.>>> pd.Series([np.nan]).sum() # doctest: +SKIP 0.0
>>> pd.Series([np.nan]).sum(min_count=1) # doctest: +SKIP nan
-
tail
(n=5, compute=True)¶ Last n rows of the dataset
Caveat, the only checks the last n rows of the last partition.
-
to_bag
(index=False)¶ Craeate a Dask Bag from a Series
-
to_csv
(filename, **kwargs)¶ Store Dask DataFrame to CSV files
One filename per partition will be created. You can specify the filenames in a variety of ways.
Use a globstring:
>>> df.to_csv('/path/to/data/export-*.csv')
The * will be replaced by the increasing sequence 0, 1, 2, …
/path/to/data/export-0.csv /path/to/data/export-1.csv
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name) # doctest: +SKIP
/path/to/data/export-2015-01-01.csv /path/to/data/export-2015-01-02.csv ...
You can also provide an explicit list of paths:
>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...] >>> df.to_csv(paths)
Parameters: filename : string
Path glob indicating the naming scheme for the output files
name_function : callable, default None
Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions
compression : string or None
String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
sep : character, default ‘,’
Field delimiter for the output file
na_rep : string, default ‘’
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
header_first_partition_only : boolean, default False
If set, only write the header row in the first output file
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
line_terminator : string, default ‘n’
The newline character or character sequence to use in the output file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data
storage_options: dict
Parameters passed on to the backend filesystem class.
Returns: The names of the file written if they were computed right away
If not, the delayed tasks associated to the writing of the files
-
to_dask_array
(lengths=None)¶ Convert a dask DataFrame to a dask array.
Parameters: lengths : bool or Sequence of ints, optional
How to determine the chunks sizes for the output array. By default, the output array will have unknown chunk lengths along the first axis, which can cause some later operations to fail.
- True : immediately compute the length of each partition
- Sequence : a sequence of integers to use for the chunk sizes on the first axis. These values are not validated for correctness, beyond ensuring that the number of items matches the number of partitions.
-
to_delayed
(optimize_graph=True)¶ Convert into a list of
dask.delayed
objects, one per partition.Parameters: optimize_graph : bool, optional
If True [default], the graph is optimized before converting into
dask.delayed
objects.See also
Examples
>>> partitions = df.to_delayed() # doctest: +SKIP
-
to_frame
(name=None)¶ Convert Series to DataFrame
Parameters: name : object, default None
The passed name should substitute for the series name (if it has one).
Returns: data_frame : DataFrame
-
to_hdf
(path_or_buf, key, mode='a', append=False, **kwargs)¶ Store Dask Dataframe to Hierarchical Data Format (HDF) files
This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.
This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix
*
within the filename or datapath, and an optionalname_function
. The asterix will be replaced with an increasing sequence of integers starting from0
or with the result of callingname_function
on each of those integers.This function only supports the Pandas
'table'
format, not the more specialized'fixed'
format.Parameters: path : string
Path to a target filename. May contain a
*
to denote many filenameskey : string
Datapath within the files. May contain a
*
to denote many locationsname_function : function
A function to convert the
*
in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)compute : bool
Whether or not to execute immediately. If False then this returns a
dask.Delayed
value.lock : Lock, optional
Lock to use to prevent concurrency issues. By default a
threading.Lock
,multiprocessing.Lock
orSerializableLock
will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.scheduler : string
The scheduler to use, like “threads” or “processes”
**other:
See pandas.to_hdf for more information
Returns: filenames : list
Returned if
compute
is True. List of file names that each partition is saved to.delayed : dask.Delayed
Returned if
compute
is False. Delayed object to executeto_hdf
when computed.See also
Examples
Save Data to a single file
>>> df.to_hdf('output.hdf', '/data') # doctest: +SKIP
Save data to multiple datapaths within the same file:
>>> df.to_hdf('output.hdf', '/data-*') # doctest: +SKIP
Save data to multiple files:
>>> df.to_hdf('output-*.hdf', '/data') # doctest: +SKIP
Save data to multiple files, using the multiprocessing scheduler:
>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') # doctest: +SKIP
Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..
>>> from datetime import date, timedelta >>> base = date(year=2000, month=1, day=1) >>> def name_function(i): ... ''' Convert integer 0 to n to a string ''' ... return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) # doctest: +SKIP
-
to_json
(filename, *args, **kwargs)¶ See dd.to_json docstring for more information
-
to_string
(max_rows=5)¶ Render a string representation of the Series
Parameters: buf : StringIO-like, optional
buffer to write to
na_rep : string, optional
string representation of NAN to use, default ‘NaN’
float_format : one-parameter function, optional
formatter function to apply to columns’ elements if they are floats default None
header: boolean, default True
Add the Series header (index name)
index : bool, optional
Add index (row) labels, default True
length : boolean, default False
Add the Series length
dtype : boolean, default False
Add the Series dtype
name : boolean, default False
Add the Series name if not None
max_rows : int, optional
Maximum number of rows to show before truncating. If None, show all.
Returns: formatted : string (if not buffer passed)
-
to_timestamp
(freq=None, how='start', axis=0)¶ Cast to DatetimeIndex of timestamps, at beginning of period
Parameters: freq : string, default frequency of PeriodIndex
Desired frequency
how : {‘s’, ‘e’, ‘start’, ‘end’}
Convention for converting period to timestamp; start of period vs. end
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
The axis to convert (the index by default)
copy : boolean, default True
If false then underlying input data is not copied
Returns: df : DataFrame with DatetimeIndex
-
truediv
(other, level=None, fill_value=None, axis=0)¶ Floating division of series and other, element-wise (binary operator truediv).
Equivalent to
series / other
, but with support to substitute a fill_value for missing data in one of the inputs.Parameters: other : Series or scalar value
fill_value : None or float value, default None (NaN)
Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result will be missing
level : int or name
Broadcast across a level, matching Index values on the passed MultiIndex level
Returns: result : Series
See also
Examples
>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd']) >>> a a 1.0 b 1.0 c 1.0 d NaN dtype: float64 >>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e']) >>> b a 1.0 b NaN d 1.0 e NaN dtype: float64 >>> a.add(b, fill_value=0) a 2.0 b 1.0 c 1.0 d 1.0 e NaN dtype: float64
-
unique
(split_every=None, split_out=1)¶ Return Series of unique values in the object. Includes NA values.
Returns: uniques : Series
-
value_counts
(split_every=None, split_out=1)¶ Returns object containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
Parameters: normalize : boolean, default False
If True then the object returned will contain the relative frequencies of the unique values.
sort : boolean, default True
Sort by values
ascending : boolean, default False
Sort in ascending order
bins : integer, optional
Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data
dropna : boolean, default True
Don’t include counts of NaN.
Returns: counts : Series
-
values
¶ Return a dask.array of the values of this dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
-
var
(axis=None, skipna=True, ddof=1, split_every=False, dtype=None, out=None)¶ Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
Parameters: axis : {index (0), columns (1)}
skipna : boolean, default True
Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : int or level name, default None
If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only : boolean, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns: var : Series or DataFrame (if level specified)
-
visualize
(filename='mydask', format=None, optimize_graph=False, **kwargs)¶ Render the computation of this object’s task graph using graphviz.
Requires
graphviz
to be installed.Parameters: filename : str or None, optional
The name (without an extension) of the file to write to disk. If filename is None, no file will be written, and we communicate with dot using only pipes.
format : {‘png’, ‘pdf’, ‘dot’, ‘svg’, ‘jpeg’, ‘jpg’}, optional
Format in which to write output file. Default is ‘png’.
optimize_graph : bool, optional
If True, the graph is optimized before rendering. Otherwise, the graph is displayed as is. Default is False.
color: {None, ‘order’}, optional
Options to color nodes. Provide
cmap=
keyword for additional colormap**kwargs
Additional keyword arguments to forward to
to_graphviz
.Returns: result : IPython.diplay.Image, IPython.display.SVG, or None
See dask.dot.dot_graph for more information.
See also
dask.base.visualize
,dask.dot.dot_graph
Notes
For more information on optimization see here:
https://docs.dask.org/en/latest/optimize.html
Examples
>>> x.visualize(filename='dask.pdf') # doctest: +SKIP >>> x.visualize(filename='dask.pdf', color='order') # doctest: +SKIP
-
where
(cond, other=nan)¶ Return an object of same shape as self and whose corresponding entries are from self where cond is True and otherwise are from other.
Parameters: cond : boolean NDFrame, array-like, or callable
Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the NDFrame and should return boolean NDFrame or array. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as cond.
other : scalar, NDFrame, or callable
Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the NDFrame and should return scalar or NDFrame. The callable must not change input NDFrame (though pandas doesn’t check it).
New in version 0.18.1: A callable can be used as other.
inplace : boolean, default False
Whether to perform the operation in place on the data
axis : alignment axis if needed, default None
level : alignment level if needed, default None
errors : str, {‘raise’, ‘ignore’}, default ‘raise’
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object
Note that currently this parameter won’t affect the results and will always coerce to a suitable dtype.
try_cast : boolean, default False
try to cast the result back to the input type (if possible),
raise_on_error : boolean, default True
Whether to raise on invalid data types (e.g. trying to where on strings)
Deprecated since version 0.21.0.
Returns: wh : same type as caller
See also
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond
isTrue
the element is used; otherwise the corresponding element from the DataFrameother
is used.The signature for
DataFrame.where()
differs fromnumpy.where()
. Roughlydf1.where(m, df2)
is equivalent tonp.where(m, df1, df2)
.For further details and examples see the
where
documentation in indexing.Examples
>>> s = pd.Series(range(5)) # doctest: +SKIP >>> s.where(s > 0) # doctest: +SKIP 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0
>>> s.mask(s > 0) # doctest: +SKIP 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN
>>> s.where(s > 1, 10) # doctest: +SKIP 0 10.0 1 10.0 2 2.0 3 3.0 4 4.0
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) # doctest: +SKIP >>> m = df % 3 == 0 # doctest: +SKIP >>> df.where(m, -df) # doctest: +SKIP A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> df.where(m, -df) == np.where(m, df, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True >>> df.where(m, -df) == df.mask(~m, -df) # doctest: +SKIP A B 0 True True 1 True True 2 True True 3 True True 4 True True
-
DataFrameGroupBy¶
-
class
dask.dataframe.groupby.
DataFrameGroupBy
(df, by=None, slice=None)¶ -
agg
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
Parameters: func : function, string, dictionary, or list of string/functions
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
Accepted combinations are:
- string function name.
- function.
- list of functions.
- dict of column names -> functions (or list of functions).
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: aggregated : DataFrame
See also
pandas.DataFrame.groupby.apply
,pandas.DataFrame.groupby.transform
,pandas.DataFrame.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame({'A': [1, 1, 2, 2], # doctest: +SKIP ... 'B': [1, 2, 3, 4], ... 'C': np.random.randn(4)})
>>> df # doctest: +SKIP A B C 0 1 1 0.362838 1 1 2 0.227877 2 2 3 1.267767 3 2 4 -0.562860
The aggregation is for each column.
>>> df.groupby('A').agg('min') # doctest: +SKIP B C A 1 1 0.227877 2 3 -0.562860
Multiple aggregations
>>> df.groupby('A').agg(['min', 'max']) # doctest: +SKIP B C min max min max A 1 1 2 0.227877 0.362838 2 3 4 -0.562860 1.267767
Select a column for aggregation
>>> df.groupby('A').B.agg(['min', 'max']) # doctest: +SKIP min max A 1 1 2 2 3 4
Different aggregations per column
>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'}) # doctest: +SKIP B C min max sum A 1 1 2 0.590716 2 3 4 0.704907
-
aggregate
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
Parameters: func : function, string, dictionary, or list of string/functions
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
Accepted combinations are:
- string function name.
- function.
- list of functions.
- dict of column names -> functions (or list of functions).
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: aggregated : DataFrame
See also
pandas.DataFrame.groupby.apply
,pandas.DataFrame.groupby.transform
,pandas.DataFrame.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame({'A': [1, 1, 2, 2], # doctest: +SKIP ... 'B': [1, 2, 3, 4], ... 'C': np.random.randn(4)})
>>> df # doctest: +SKIP A B C 0 1 1 0.362838 1 1 2 0.227877 2 2 3 1.267767 3 2 4 -0.562860
The aggregation is for each column.
>>> df.groupby('A').agg('min') # doctest: +SKIP B C A 1 1 0.227877 2 3 -0.562860
Multiple aggregations
>>> df.groupby('A').agg(['min', 'max']) # doctest: +SKIP B C min max min max A 1 1 2 0.227877 0.362838 2 3 4 -0.562860 1.267767
Select a column for aggregation
>>> df.groupby('A').B.agg(['min', 'max']) # doctest: +SKIP min max A 1 1 2 2 3 4
Different aggregations per column
>>> df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'}) # doctest: +SKIP B C min max sum A 1 1 2 0.590716 2 3 4 0.704907
-
apply
(func, *args, **kwargs)¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- The user should provide output metadata.
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters: func: function
Function to apply
args, kwargs : Scalar, Delayed or object
Arguments and keywords to pass to the function.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: applied : Series or DataFrame depending on columns keyword
-
count
(split_every=None, split_out=1)¶ Compute count of group, excluding missing values
-
cumcount
(axis=None)¶ Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
>>> self.apply(lambda x: Series(np.arange(len(x)), x.index)) # doctest: +SKIP
Parameters: ascending : bool, default True
If False, number in reverse, from length of group - 1 to 0.
See also
ngroup
- Number the groups themselves.
Examples
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], # doctest: +SKIP ... columns=['A']) >>> df # doctest: +SKIP A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount() # doctest: +SKIP 0 0 1 1 2 2 3 0 4 1 5 3 dtype: int64 >>> df.groupby('A').cumcount(ascending=False) # doctest: +SKIP 0 3 1 2 2 1 3 1 4 0 5 0 dtype: int64
-
cumprod
(axis=0)¶ Cumulative product for each group
-
cumsum
(axis=0)¶ Cumulative sum for each group
-
first
(split_every=None, split_out=1)¶ Compute first of group values
-
get_group
(key)¶ Constructs NDFrame from group with provided name
Parameters: name : object
the name of the group to get as a DataFrame
obj : NDFrame, default None
the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group : type of obj
-
last
(split_every=None, split_out=1)¶ Compute last of group values
-
max
(split_every=None, split_out=1)¶ Compute max of group values
-
mean
(split_every=None, split_out=1)¶ Compute mean of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
-
min
(split_every=None, split_out=1)¶ Compute min of group values
-
size
(split_every=None, split_out=1)¶ Compute group sizes
-
std
(ddof=1, split_every=None, split_out=1)¶ Compute standard deviation of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
-
sum
(split_every=None, split_out=1)¶ Compute sum of group values
-
var
(ddof=1, split_every=None, split_out=1)¶ Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
-
SeriesGroupBy¶
-
class
dask.dataframe.groupby.
SeriesGroupBy
(df, by=None, slice=None)¶ -
agg
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
Parameters: func : function, string, dictionary, or list of string/functions
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
Accepted combinations are:
- string function name.
- function.
- list of functions.
- dict of column names -> functions (or list of functions).
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: aggregated : Series
See also
pandas.Series.groupby.apply
,pandas.Series.groupby.transform
,pandas.Series.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = Series([1, 2, 3, 4]) # doctest: +SKIP
>>> s # doctest: +SKIP 0 1 1 2 2 3 3 4 dtype: int64
>>> s.groupby([1, 1, 2, 2]).min() # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min') # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max']) # doctest: +SKIP min max 1 1 2 2 3 4
-
aggregate
(arg, split_every=None, split_out=1)¶ Aggregate using one or more operations over the specified axis.
Parameters: func : function, string, dictionary, or list of string/functions
Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
Accepted combinations are:
- string function name.
- function.
- list of functions.
- dict of column names -> functions (or list of functions).
*args
Positional arguments to pass to func.
**kwargs
Keyword arguments to pass to func.
Returns: aggregated : Series
See also
pandas.Series.groupby.apply
,pandas.Series.groupby.transform
,pandas.Series.aggregate
Notes
agg is an alias for aggregate. Use the alias.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> s = Series([1, 2, 3, 4]) # doctest: +SKIP
>>> s # doctest: +SKIP 0 1 1 2 2 3 3 4 dtype: int64
>>> s.groupby([1, 1, 2, 2]).min() # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg('min') # doctest: +SKIP 1 1 2 3 dtype: int64
>>> s.groupby([1, 1, 2, 2]).agg(['min', 'max']) # doctest: +SKIP min max 1 1 2 2 3 4
-
apply
(func, *args, **kwargs)¶ Parallel version of pandas GroupBy.apply
This mimics the pandas version except for the following:
- The user should provide output metadata.
- If the grouper does not align with the index then this causes a full shuffle. The order of rows within each group may not be preserved.
Parameters: func: function
Function to apply
args, kwargs : Scalar, Delayed or object
Arguments and keywords to pass to the function.
meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.Returns: applied : Series or DataFrame depending on columns keyword
-
count
(split_every=None, split_out=1)¶ Compute count of group, excluding missing values
-
cumcount
(axis=None)¶ Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
>>> self.apply(lambda x: Series(np.arange(len(x)), x.index)) # doctest: +SKIP
Parameters: ascending : bool, default True
If False, number in reverse, from length of group - 1 to 0.
See also
ngroup
- Number the groups themselves.
Examples
>>> df = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']], # doctest: +SKIP ... columns=['A']) >>> df # doctest: +SKIP A 0 a 1 a 2 a 3 b 4 b 5 a >>> df.groupby('A').cumcount() # doctest: +SKIP 0 0 1 1 2 2 3 0 4 1 5 3 dtype: int64 >>> df.groupby('A').cumcount(ascending=False) # doctest: +SKIP 0 3 1 2 2 1 3 1 4 0 5 0 dtype: int64
-
cumprod
(axis=0)¶ Cumulative product for each group
-
cumsum
(axis=0)¶ Cumulative sum for each group
-
first
(split_every=None, split_out=1)¶ Compute first of group values
-
get_group
(key)¶ Constructs NDFrame from group with provided name
Parameters: name : object
the name of the group to get as a DataFrame
obj : NDFrame, default None
the NDFrame to take the DataFrame out of. If it is None, the object groupby was called on will be used
Returns: group : type of obj
-
last
(split_every=None, split_out=1)¶ Compute last of group values
-
max
(split_every=None, split_out=1)¶ Compute max of group values
-
mean
(split_every=None, split_out=1)¶ Compute mean of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
-
min
(split_every=None, split_out=1)¶ Compute min of group values
-
size
(split_every=None, split_out=1)¶ Compute group sizes
-
std
(ddof=1, split_every=None, split_out=1)¶ Compute standard deviation of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
-
sum
(split_every=None, split_out=1)¶ Compute sum of group values
-
var
(ddof=1, split_every=None, split_out=1)¶ Compute variance of groups, excluding missing values
For multiple groupings, the result index will be a MultiIndex
Parameters: ddof : integer, default 1
degrees of freedom
-
Storage and Conversion¶
-
dask.dataframe.
read_csv
(urlpath, blocksize=64000000, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶ Read CSV files into a Dask.DataFrame
This parallelizes the
pandas.read_csv()
function in the following ways:It supports loading many files at once using globstrings:
>>> df = dd.read_csv('myfiles.*.csv') # doctest: +SKIP
In some cases it can break up large files:
>>> df = dd.read_csv('largefile.csv', blocksize=25e6) # 25MB chunks # doctest: +SKIP
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:
>>> df = dd.read_csv('s3://bucket/myfiles.*.csv') # doctest: +SKIP >>> df = dd.read_csv('hdfs:///myfiles.*.csv') # doctest: +SKIP >>> df = dd.read_csv('hdfs://namenode.example.com/myfiles.*.csv') # doctest: +SKIP
Internally
dd.read_csv
usespandas.read_csv()
and supports many of the same keyword arguments with the same performance guarantees. See the docstring forpandas.read_csv()
for more information on available keyword arguments.Parameters: urlpath : string or list
Absolute or relative filepath(s). Prefix with a protocol like
s3://
to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.blocksize : int or None, optional
Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If
None
, use a single block for each file.collection : boolean, optional
Return a dask.dataframe if True or list of dask.delayed objects if False
sample : int, optional
Number of bytes to use when determining dtypes
assume_missing : bool, optional
If True, all integer columns that aren’t specified in
dtype
are assumed to contain missing values, and are converted to floats. Default is False.storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
include_path_column : bool or str, optional
Whether or not to include the path to each particular file. If True a new column is added to the dataframe called
path
. If str, sets new column name. Default is False.**kwargs
Extra keyword arguments to forward to
pandas.read_csv()
.Notes
Dask dataframe tries to infer the
dtype
of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if thedtype
is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was aNaN
, then this would error at compute time. To fix this, you have a few options:- Provide explicit dtypes for the offending columns using the
dtype
keyword. This is the recommended solution. - Use the
assume_missing
keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. - Increase the size of the sample using the
sample
keyword.
It should also be noted that this function may fail if a CSV file includes quoted strings that contain the line terminator. To get around this you can specify
blocksize=None
to not split files into multiple partitions, at the cost of reduced parallelism.
-
dask.dataframe.
read_table
(urlpath, blocksize=64000000, collection=True, lineterminator=None, compression=None, sample=256000, enforce=False, assume_missing=False, storage_options=None, include_path_column=False, **kwargs)¶ Read delimited files into a Dask.DataFrame
This parallelizes the
pandas.read_table()
function in the following ways:It supports loading many files at once using globstrings:
>>> df = dd.read_table('myfiles.*.csv') # doctest: +SKIP
In some cases it can break up large files:
>>> df = dd.read_table('largefile.csv', blocksize=25e6) # 25MB chunks # doctest: +SKIP
It can read CSV files from external resources (e.g. S3, HDFS) by providing a URL:
>>> df = dd.read_table('s3://bucket/myfiles.*.csv') # doctest: +SKIP >>> df = dd.read_table('hdfs:///myfiles.*.csv') # doctest: +SKIP >>> df = dd.read_table('hdfs://namenode.example.com/myfiles.*.csv') # doctest: +SKIP
Internally
dd.read_table
usespandas.read_table()
and supports many of the same keyword arguments with the same performance guarantees. See the docstring forpandas.read_table()
for more information on available keyword arguments.Parameters: urlpath : string or list
Absolute or relative filepath(s). Prefix with a protocol like
s3://
to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol.blocksize : int or None, optional
Number of bytes by which to cut up larger files. Default value is computed based on available physical memory and the number of cores. If
None
, use a single block for each file.collection : boolean, optional
Return a dask.dataframe if True or list of dask.delayed objects if False
sample : int, optional
Number of bytes to use when determining dtypes
assume_missing : bool, optional
If True, all integer columns that aren’t specified in
dtype
are assumed to contain missing values, and are converted to floats. Default is False.storage_options : dict, optional
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc.
include_path_column : bool or str, optional
Whether or not to include the path to each particular file. If True a new column is added to the dataframe called
path
. If str, sets new column name. Default is False.**kwargs
Extra keyword arguments to forward to
pandas.read_table()
.Notes
Dask dataframe tries to infer the
dtype
of each column by reading a sample from the start of the file (or of the first file if it’s a glob). Usually this works fine, but if thedtype
is different later in the file (or in other files) this can cause issues. For example, if all the rows in the sample had integer dtypes, but later on there was aNaN
, then this would error at compute time. To fix this, you have a few options:- Provide explicit dtypes for the offending columns using the
dtype
keyword. This is the recommended solution. - Use the
assume_missing
keyword to assume that all columns inferred as integers contain missing values, and convert them to floats. - Increase the size of the sample using the
sample
keyword.
It should also be noted that this function may fail if a delimited file includes quoted strings that contain the line terminator. To get around this you can specify
blocksize=None
to not split files into multiple partitions, at the cost of reduced parallelism.
-
dask.dataframe.
read_parquet
(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', infer_divisions=None)¶ Read ParquetFile into a Dask DataFrame
This reads a directory of Parquet data into a Dask.dataframe, one file per partition. It selects the index among the sorted columns if any exist.
Parameters: path : string, list or fastparquet.ParquetFile
Source directory for data, or path(s) to individual parquet files. Prefix with a protocol like
s3://
to read from alternative filesystems. To read from multiple files you can pass a globstring or a list of paths, with the caveat that they must all have the same protocol. Alternatively, also accepts a previously opened fastparquet.ParquetFile()columns : string, list or None (default)
Field name(s) to read in as columns in the output. By default all non-index fields will be read (as determined by the pandas parquet metadata, if present). Provide a single field name instead of a list to read in the data as a Series.
filters : list
List of filters to apply, like
[('x', '>', 0), ...]
. This implements row-group (partition) -level filtering only, i.e., to prevent the loading of some chunks of the data, and only if relevant statistics have been included in the metadata.index : string, list, False or None (default)
Field name(s) to use as the output frame index. By default will be inferred from the pandas parquet file metadata (if present). Use False to read all fields as columns.
categories : list, dict or None
For any fields listed here, if the parquet encoding is Dictionary, the column will be created with dtype category. Use only if it is guaranteed that the column is encoded as dictionary in all row-groups. If a list, assumes up to 2**16-1 labels; if a dict, specify the number of labels expected; if None, will load categories automatically for data written by dask/fastparquet, not otherwise.
storage_options : dict
Key/value pairs to be passed on to the file-system backend, if any.
engine : {‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’
Parquet reader library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’
infer_divisions : bool or None (default).
By default, divisions are inferred if the read engine supports doing so efficiently and the index of the underlying dataset is sorted across the individual parquet files. Set to
True
to force divisions to be inferred in all cases. Note that this may require reading metadata from each file in the dataset, which may be expensive. Set toFalse
to never infer divisions.See also
Examples
>>> df = dd.read_parquet('s3://bucket/my-parquet-data') # doctest: +SKIP
-
dask.dataframe.
read_orc
(path, columns=None, storage_options=None)¶ Read dataframe from ORC file(s)
Parameters: path: str or list(str)
Location of file(s), which can be a full URL with protocol specifier, and may include glob character if a single string.
columns: None or list(str)
Columns to load. If None, loads all.
storage_options: None or dict
Further parameters to pass to the bytes backend.
Returns: Dask.DataFrame (even if there is only one column)
Examples
>>> df = dd.read_orc('https://github.com/apache/orc/raw/' ... 'master/examples/demo-11-zlib.orc') # doctest: +SKIP
-
dask.dataframe.
read_hdf
(pattern, key, start=0, stop=None, columns=None, chunksize=1000000, sorted_index=False, lock=True, mode='a')¶ Read HDF files into a Dask DataFrame
Read hdf files into a dask dataframe. This function is like
pandas.read_hdf
, except it can read from a single large file, or from multiple files, or from multiple keys from the same file.Parameters: pattern : string, list
File pattern (string), buffer to read from, or list of file paths. Can contain wildcards.
key : group identifier in the store. Can contain wildcards
start : optional, integer (defaults to 0), row number to start at
stop : optional, integer (defaults to None, the last row), row number to
stop at
columns : list of columns, optional
A list of columns that if not None, will limit the return columns (default is None)
chunksize : positive integer, optional
Maximal number of rows per partition (default is 1000000).
sorted_index : boolean, optional
Option to specify whether or not the input hdf files have a sorted index (default is False).
lock : boolean, optional
Option to use a lock to prevent concurrency issues (default is True).
mode : {‘a’, ‘r’, ‘r+’}, default ‘a’. Mode to use when opening file(s).
- ‘r’
Read-only; no data can be modified.
- ‘a’
Append; an existing file is opened for reading and writing, and if the file does not exist it is created.
- ‘r+’
It is similar to ‘a’, but the file must already exist.
Returns: dask.DataFrame
Examples
Load single file
>>> dd.read_hdf('myfile.1.hdf5', '/x') # doctest: +SKIP
Load multiple files
>>> dd.read_hdf('myfile.*.hdf5', '/x') # doctest: +SKIP
>>> dd.read_hdf(['myfile.1.hdf5', 'myfile.2.hdf5'], '/x') # doctest: +SKIP
Load multiple datasets
>>> dd.read_hdf('myfile.1.hdf5', '/*') # doctest: +SKIP
-
dask.dataframe.
read_json
(url_path, orient='records', lines=None, storage_options=None, blocksize=None, sample=1048576, encoding='utf-8', errors='strict', compression='infer', **kwargs)¶ Create a dataframe from a set of JSON files
This utilises
pandas.read_json()
, and most parameters are passed through - see its docstring.Differences: orient is ‘records’ by default, with lines=True; this is appropriate for line-delimited “JSON-lines” data, the kind of JSON output that is most common in big-data scenarios, and which can be chunked when reading (see
read_json()
). All other options require blocksize=None, i.e., one partition per input file.Parameters: url_path: str, list of str
Location to read from. If a string, can include a glob character to find a set of file names. Supports protocol specifications such as
"s3://"
.encoding, errors:
The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see
str.encode()
).orient, lines, kwargs
passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
storage_options: dict
Passed to backend file-system implementation
blocksize: None or int
If None, files are not blocked, and you get one partition per input file. If int, which can only be used for line-delimited JSON files, each partition will be approximately this size in bytes, to the nearest newline character.
sample: int
Number of bytes to pre-load, to provide an empty dataframe structure to any blocks wihout data. Only relevant is using blocksize.
encoding, errors:
Text conversion,
see bytes.decode()
compression : string or None
String like ‘gzip’ or ‘xz’.
Returns: dask.DataFrame
Examples
Load single file
>>> dd.read_json('myfile.1.json') # doctest: +SKIP
Load multiple files
>>> dd.read_json('myfile.*.json') # doctest: +SKIP
>>> dd.read_json(['myfile.1.json', 'myfile.2.json']) # doctest: +SKIP
Load large line-delimited JSON files using partitions of approx 256MB size
>> dd.read_json(‘data/file*.csv’, blocksize=2**28)
-
dask.dataframe.
read_sql_table
(table, uri, index_col, divisions=None, npartitions=None, limits=None, columns=None, bytes_per_chunk=268435456, head_rows=5, schema=None, meta=None, engine_kwargs=None, **kwargs)¶ Create dataframe from an SQL table.
If neither divisions or npartitions is given, the memory footprint of the first few rows will be determined, and partitions of size ~256MB will be used.
Parameters: table : string or sqlalchemy expression
Select columns from here.
uri : string
Full sqlalchemy URI for the database connection
index_col : string
Column which becomes the index, and defines the partitioning. Should be a indexed column in the SQL server, and any orderable type. If the type is number or time, then partition boundaries can be inferred from npartitions or bytes_per_chunk; otherwide must supply explicit
divisions=
.index_col
could be a function to return a value, e.g.,sql.func.abs(sql.column('value')).label('abs(value)')
. Labeling columns created by functions or arithmetic operations is required.divisions: sequence
Values of the index column to split the table by. If given, this will override npartitions and bytes_per_chunk. The divisions are the value boundaries of the index column used to define the partitions. For example,
divisions=list('acegikmoqsuwz')
could be used to partition a string column lexographically into 12 partitions, with the implicit assumption that each partition contains similar numbers of records.npartitions : int
Number of partitions, if divisions is not given. Will split the values of the index column linearly between limits, if given, or the column max/min. The index column must be numeric or time for this to work
limits: 2-tuple or None
Manually give upper and lower range of values for use with npartitions; if None, first fetches max/min from the DB. Upper limit, if given, is inclusive.
columns : list of strings or None
Which columns to select; if None, gets all; can include sqlalchemy functions, e.g.,
sql.func.abs(sql.column('value')).label('abs(value)')
. Labeling columns created by functions or arithmetic operations is recommended.bytes_per_chunk : int
If both divisions and npartitions is None, this is the target size of each partition, in bytes
head_rows : int
How many rows to load for inferring the data-types, unless passing meta
meta : empty DataFrame or None
If provided, do not attempt to infer dtypes, but use these, coercing all chunks on load
schema : str or None
If using a table name, pass this to sqlalchemy to select which DB schema to use within the URI connection
engine_kwargs : dict or None
Specific db engine parameters for sqlalchemy
kwargs : dict
Additional parameters to pass to pd.read_sql()
Returns: dask.dataframe
Examples
>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db', ... npartitions=10, index_col='id') # doctest: +SKIP
-
dask.dataframe.
from_array
(x, chunksize=50000, columns=None)¶ Read any slicable array into a Dask Dataframe
Uses getitem syntax to pull slices out of the array. The array need not be a NumPy array but must support slicing syntax
x[50000:100000]and have 2 dimensions:
x.ndim == 2or have a record dtype:
x.dtype == [(‘name’, ‘O’), (‘balance’, ‘i8’)]
-
dask.dataframe.
from_pandas
(data, npartitions=None, chunksize=None, sort=True, name=None)¶ Construct a Dask DataFrame from a Pandas DataFrame
This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel.
Note that, despite parallelism, Dask.dataframe may not always be faster than Pandas. We recommend that you stay with Pandas for as long as possible before switching to Dask.dataframe.
Parameters: df : pandas.DataFrame or pandas.Series
The DataFrame/Series with which to construct a Dask DataFrame/Series
npartitions : int, optional
The number of partitions of the index to create. Note that depending on the size and index of the dataframe, the output may have fewer partitions than requested.
chunksize : int, optional
The number of rows per index partition to use.
sort: bool
Sort input first to obtain cleanly divided partitions or don’t sort and don’t get cleanly divided partitions
name: string, optional
An optional keyname for the dataframe. Defaults to hashing the input
Returns: dask.DataFrame or dask.Series
A dask DataFrame/Series partitioned along the index
Raises: TypeError
If something other than a
pandas.DataFrame
orpandas.Series
is passed in.See also
from_array
- Construct a dask.DataFrame from an array that has record dtype
read_csv
- Construct a dask.DataFrame from a CSV file
Examples
>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))), ... index=pd.date_range(start='20100101', periods=6)) >>> ddf = from_pandas(df, npartitions=3) >>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE (Timestamp('2010-01-01 00:00:00', freq='D'), Timestamp('2010-01-03 00:00:00', freq='D'), Timestamp('2010-01-05 00:00:00', freq='D'), Timestamp('2010-01-06 00:00:00', freq='D')) >>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too! >>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE (Timestamp('2010-01-01 00:00:00', freq='D'), Timestamp('2010-01-03 00:00:00', freq='D'), Timestamp('2010-01-05 00:00:00', freq='D'), Timestamp('2010-01-06 00:00:00', freq='D'))
-
dask.dataframe.
from_bcolz
(x, chunksize=None, categorize=True, index=None, lock=<unlocked _thread.lock object>, **kwargs)¶ Read BColz CTable into a Dask Dataframe
BColz is a fast on-disk compressed column store with careful attention given to compression. https://bcolz.readthedocs.io/en/latest/
Parameters: x : bcolz.ctable
chunksize : int, optional
The size(rows) of blocks to pull out from ctable.
categorize : bool, defaults to True
Automatically categorize all string dtypes
index : string, optional
Column to make the index
lock: bool or Lock
Lock to use when reading or False for no lock (not-thread-safe)
See also
from_array
- more generic function not optimized for bcolz
-
dask.dataframe.
from_dask_array
(x, columns=None, index=None)¶ Create a Dask DataFrame from a Dask Array.
Converts a 2d array into a DataFrame and a 1d array into a Series.
Parameters: x : da.Array
columns : list or string
list of column names if DataFrame, single string if Series
index : dask.dataframe.Index, optional
An optional dask Index to use for the output Series or DataFrame.
The default output index depends on whether x has any unknown chunks. If there are any unknown chunks, the output has
None
for all the divisions (one per chunk). If all the chunks are known, a default index with known divsions is created.Specifying index can be useful if you’re conforming a Dask Array to an existing dask Series or DataFrame, and you would like the indices to match.
See also
dask.bag.to_dataframe
- from dask.bag
dask.dataframe._Frame.values
- Reverse conversion
dask.dataframe._Frame.to_records
- Reverse conversion
Examples
>>> import dask.array as da >>> import dask.dataframe as dd >>> x = da.ones((4, 2), chunks=(2, 2)) >>> df = dd.io.from_dask_array(x, columns=['a', 'b']) >>> df.compute() a b 0 1.0 1.0 1 1.0 1.0 2 1.0 1.0 3 1.0 1.0
-
dask.dataframe.
from_delayed
(dfs, meta=None, divisions=None, prefix='from-delayed')¶ Create Dask DataFrame from many Dask Delayed objects
Parameters: dfs : list of Delayed
An iterable of
dask.delayed.Delayed
objects, such as come fromdask.delayed
These comprise the individual partitions of the resulting dataframe.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.divisions : tuple, str, optional
Partition boundaries along the index. For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions For string ‘sorted’ will compute the delayed values to find index values. Assumes that the indexes are mutually sorted. If None, then won’t use index information
prefix : str, optional
Prefix to prepend to the keys.
-
dask.dataframe.
to_records
(df)¶ Create Dask Array from a Dask Dataframe
Warning: This creates a dask.array without precise shape information. Operations that depend on shape information, like slicing or reshaping, will not work.
See also
dask.dataframe._Frame.values
,dask.dataframe.from_dask_array
Examples
>>> df.to_records() # doctest: +SKIP dask.array<shape=(nan,), dtype=(numpy.record, [('ind', '<f8'), ('x', 'O'), ('y', '<i8')]), chunksize=(nan,)>
-
dask.dataframe.
to_csv
(df, filename, name_function=None, compression=None, compute=True, scheduler=None, storage_options=None, header_first_partition_only=False, **kwargs)¶ Store Dask DataFrame to CSV files
One filename per partition will be created. You can specify the filenames in a variety of ways.
Use a globstring:
>>> df.to_csv('/path/to/data/export-*.csv')
The * will be replaced by the increasing sequence 0, 1, 2, …
/path/to/data/export-0.csv /path/to/data/export-1.csv
Use a globstring and a
name_function=
keyword argument. The name_function function should expect an integer and produce a string. Strings produced by name_function must preserve the order of their respective partition indices.>>> from datetime import date, timedelta >>> def name(i): ... return str(date(2015, 1, 1) + i * timedelta(days=1))
>>> name(0) '2015-01-01' >>> name(15) '2015-01-16'
>>> df.to_csv('/path/to/data/export-*.csv', name_function=name) # doctest: +SKIP
/path/to/data/export-2015-01-01.csv /path/to/data/export-2015-01-02.csv ...
You can also provide an explicit list of paths:
>>> paths = ['/path/to/data/alice.csv', '/path/to/data/bob.csv', ...] >>> df.to_csv(paths)
Parameters: filename : string
Path glob indicating the naming scheme for the output files
name_function : callable, default None
Function accepting an integer (partition index) and producing a string to replace the asterisk in the given filename globstring. Should preserve the lexicographic order of partitions
compression : string or None
String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
sep : character, default ‘,’
Field delimiter for the output file
na_rep : string, default ‘’
Missing data representation
float_format : string, default None
Format string for floating point numbers
columns : sequence, optional
Columns to write
header : boolean or list of string, default True
Write out column names. If a list of string is given it is assumed to be aliases for the column names
header_first_partition_only : boolean, default False
If set, only write the header row in the first output file
index : boolean, default True
Write row names (index)
index_label : string or sequence, or False, default None
Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R
nanRep : None
deprecated, use na_rep
mode : str
Python write mode, default ‘w’
encoding : string, optional
A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first argument is a filename
line_terminator : string, default ‘n’
The newline character or character sequence to use in the output file
quoting : optional constant from csv module
defaults to csv.QUOTE_MINIMAL
quotechar : string (length 1), default ‘”’
character used to quote fields
doublequote : boolean, default True
Control quoting of quotechar inside a field
escapechar : string (length 1), default None
character used to escape sep and quotechar when appropriate
chunksize : int or None
rows to write at a time
tupleize_cols : boolean, default False
write multi_index columns as a list of tuples (if True) or new (expanded format) if False)
date_format : string, default None
Format string for datetime objects
decimal: string, default ‘.’
Character recognized as decimal separator. E.g. use ‘,’ for European data
storage_options: dict
Parameters passed on to the backend filesystem class.
Returns: The names of the file written if they were computed right away
If not, the delayed tasks associated to the writing of the files
-
dask.dataframe.
to_bag
(df, index=False)¶ Create Dask Bag from a Dask DataFrame
Parameters: index : bool, optional
If True, the elements are tuples of
(index, value)
, otherwise they’re just thevalue
. Default is False.Examples
>>> bag = df.to_bag() # doctest: +SKIP
-
dask.dataframe.
to_hdf
(df, path, key, mode='a', append=False, scheduler=None, name_function=None, compute=True, lock=None, dask_kwargs={}, **kwargs)¶ Store Dask Dataframe to Hierarchical Data Format (HDF) files
This is a parallel version of the Pandas function of the same name. Please see the Pandas docstring for more detailed information about shared keyword arguments.
This function differs from the Pandas version by saving the many partitions of a Dask DataFrame in parallel, either to many files, or to many datasets within the same file. You may specify this parallelism with an asterix
*
within the filename or datapath, and an optionalname_function
. The asterix will be replaced with an increasing sequence of integers starting from0
or with the result of callingname_function
on each of those integers.This function only supports the Pandas
'table'
format, not the more specialized'fixed'
format.Parameters: path : string
Path to a target filename. May contain a
*
to denote many filenameskey : string
Datapath within the files. May contain a
*
to denote many locationsname_function : function
A function to convert the
*
in the above options to a string. Should take in a number from 0 to the number of partitions and return a string. (see examples below)compute : bool
Whether or not to execute immediately. If False then this returns a
dask.Delayed
value.lock : Lock, optional
Lock to use to prevent concurrency issues. By default a
threading.Lock
,multiprocessing.Lock
orSerializableLock
will be used depending on your scheduler if a lock is required. See dask.utils.get_scheduler_lock for more information about lock selection.scheduler : string
The scheduler to use, like “threads” or “processes”
**other:
See pandas.to_hdf for more information
Returns: filenames : list
Returned if
compute
is True. List of file names that each partition is saved to.delayed : dask.Delayed
Returned if
compute
is False. Delayed object to executeto_hdf
when computed.See also
Examples
Save Data to a single file
>>> df.to_hdf('output.hdf', '/data') # doctest: +SKIP
Save data to multiple datapaths within the same file:
>>> df.to_hdf('output.hdf', '/data-*') # doctest: +SKIP
Save data to multiple files:
>>> df.to_hdf('output-*.hdf', '/data') # doctest: +SKIP
Save data to multiple files, using the multiprocessing scheduler:
>>> df.to_hdf('output-*.hdf', '/data', scheduler='processes') # doctest: +SKIP
Specify custom naming scheme. This writes files as ‘2000-01-01.hdf’, ‘2000-01-02.hdf’, ‘2000-01-03.hdf’, etc..
>>> from datetime import date, timedelta >>> base = date(year=2000, month=1, day=1) >>> def name_function(i): ... ''' Convert integer 0 to n to a string ''' ... return base + timedelta(days=i)
>>> df.to_hdf('*.hdf', '/data', name_function=name_function) # doctest: +SKIP
-
dask.dataframe.
to_parquet
(df, path, engine='auto', compression='default', write_index=None, append=False, ignore_divisions=False, partition_on=None, storage_options=None, compute=True, **kwargs)¶ Store Dask.dataframe to Parquet files
Parameters: df : dask.dataframe.DataFrame
path : string
Destination directory for data. Prepend with protocol like
s3://
orhdfs://
for remote data.engine : {‘auto’, ‘fastparquet’, ‘pyarrow’}, default ‘auto’
Parquet library to use. If only one library is installed, it will use that one; if both, it will use ‘fastparquet’.
compression : string or dict, optional
Either a string like
"snappy"
or a dictionary mapping column names to compressors like{"name": "gzip", "values": "snappy"}
. The default is"default"
, which uses the default compression for whichever engine is selected.write_index : boolean, optional
Whether or not to write the index. Defaults to True if divisions are known.
append : bool, optional
If False (default), construct data-set from scratch. If True, add new row-group(s) to an existing data-set. In the latter case, the data-set must exist, and the schema must match the input data.
ignore_divisions : bool, optional
If False (default) raises error when previous divisions overlap with the new appended divisions. Ignored if append=False.
partition_on : list, optional
Construct directory-based partitioning by splitting on these fields’ values. Each dask partition will result in one or more datafiles, there will be no global groupby.
storage_options : dict, optional
Key/value pairs to be passed on to the file-system backend, if any.
compute : bool, optional
If True (default) then the result is computed immediately. If False then a
dask.delayed
object is returned for future computation.**kwargs
Extra options to be passed on to the specific backend.
See also
read_parquet
- Read parquet data to dask.dataframe
Notes
Each partition will be written to a separate file.
Examples
>>> df = dd.read_csv(...) # doctest: +SKIP >>> to_parquet('/path/to/output/', df, compression='snappy') # doctest: +SKIP
-
dask.dataframe.
to_json
(df, url_path, orient='records', lines=None, storage_options=None, compute=True, encoding='utf-8', errors='strict', compression=None, **kwargs)¶ Write dataframe into JSON text files
This utilises
pandas.DataFrame.to_json()
, and most parameters are passed through - see its docstring.Differences: orient is ‘records’ by default, with lines=True; this produces the kind of JSON output that is most common in big-data applications, and which can be chunked when reading (see
read_json()
).Parameters: df: dask.DataFrame
Data to save
url_path: str, list of str
Location to write to. If a string, and there are more than one partitions in df, should include a glob character to expand into a set of file names, or provide a
name_function=
parameter. Supports protocol specifications such as"s3://"
.encoding, errors:
The text encoding to implement, e.g., “utf-8” and how to respond to errors in the conversion (see
str.encode()
).orient, lines, kwargs
passed to pandas; if not specified, lines=True when orient=’records’, False otherwise.
storage_options: dict
Passed to backend file-system implementation
compute: bool
If true, immediately executes. If False, returns a set of delayed objects, which can be computed at a later time.
encoding, errors:
Text conversion,
see str.encode()
compression : string or None
String like ‘gzip’ or ‘xz’.
Rolling¶
-
dask.dataframe.rolling.
map_overlap
(func, df, before, after, *args, **kwargs)¶ Apply a function to each partition, sharing rows with adjacent partitions.
Parameters: func : function
Function applied to each partition.
df : dd.DataFrame, dd.Series
before : int or timedelta
The rows to prepend to partition
i
from the end of partitioni - 1
.after : int or timedelta
The rows to append to partition
i
from the beginning of partitioni + 1
.args, kwargs :
Arguments and keywords to pass to the function. The partition will be the first argument, and these will be passed after.
See also
dd.DataFrame.map_overlap
Other functions¶
-
dask.dataframe.
compute
(*args, **kwargs)¶ Compute several dask collections at once.
Parameters: args : object
Any number of objects. If it is a dask object, it’s computed and the result is returned. By default, python builtin collections are also traversed to look for dask objects (for more information see the
traverse
keyword). Non-dask arguments are passed through unchanged.traverse : bool, optional
By default dask traverses builtin python collections looking for dask objects passed to
compute
. For large collections this can be expensive. If none of the arguments contain any dask objects, settraverse=False
to avoid doing this traversal.scheduler : string, optional
Which scheduler to use like “threads”, “synchronous” or “processes”. If not provided, the default is to check the global settings first, and then fall back to the collection defaults.
optimize_graph : bool, optional
If True [default], the optimizations for each collection are applied before computation. Otherwise the graph is run as is. This can be useful for debugging.
kwargs
Extra keywords to forward to the scheduler function.
Examples
>>> import dask.array as da >>> a = da.arange(10, chunks=2).sum() >>> b = da.arange(10, chunks=2).mean() >>> compute(a, b) (45, 4.5)
By default, dask objects inside python collections will also be computed:
>>> compute({'a': a, 'b': b, 'c': 1}) # doctest: +SKIP ({'a': 45, 'b': 4.5, 'c': 1},)
-
dask.dataframe.
map_partitions
(func, *args, **kwargs)¶ Apply Python function on each DataFrame partition.
Parameters: func : function
Function applied to each partition.
args, kwargs :
Arguments and keywords to pass to the function. At least one of the args should be a Dask.dataframe. Arguments and keywords may contain
Scalar
,Delayed
or regular python objects.meta : pd.DataFrame, pd.Series, dict, iterable, tuple, optional
An empty
pd.DataFrame
orpd.Series
that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of aDataFrame
, adict
of{name: dtype}
or iterable of(name, dtype)
can be provided. Instead of a series, a tuple of(name, dtype)
can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providingmeta
is recommended. For more information, seedask.dataframe.utils.make_meta
.
-
dask.dataframe.multi.
concat
(dfs, axis=0, join='outer', interleave_partitions=False)¶ Concatenate DataFrames along rows.
- When axis=0 (default), concatenate DataFrames row-wise:
- If all divisions are known and ordered, concatenate DataFrames keeping divisions. When divisions are not ordered, specifying interleave_partition=True allows concatenate divisions each by each.
- If any of division is unknown, concatenate DataFrames resetting its division to unknown (None)
- When axis=1, concatenate DataFrames column-wise:
- Allowed if all divisions are known.
- If any of division is unknown, it raises ValueError.
Parameters: dfs : list
List of dask.DataFrames to be concatenated
axis : {0, 1, ‘index’, ‘columns’}, default 0
The axis to concatenate along
join : {‘inner’, ‘outer’}, default ‘outer’
How to handle indexes on other axis
interleave_partitions : bool, default False
Whether to concatenate DataFrames ignoring its order. If True, every divisions are concatenated each by each.
Notes
This differs in from
pd.concat
in the when concatenating Categoricals with different categories. Pandas currently coerces those to objects before concatenating. Coercing to objects is very expensive for large arrays, so dask preserves the Categoricals by taking the union of the categories.Examples
If all divisions are known and ordered, divisions are kept.
>>> a # doctest: +SKIP dd.DataFrame<x, divisions=(1, 3, 5)> >>> b # doctest: +SKIP dd.DataFrame<y, divisions=(6, 8, 10)> >>> dd.concat([a, b]) # doctest: +SKIP dd.DataFrame<concat-..., divisions=(1, 3, 6, 8, 10)>
Unable to concatenate if divisions are not ordered.
>>> a # doctest: +SKIP dd.DataFrame<x, divisions=(1, 3, 5)> >>> b # doctest: +SKIP dd.DataFrame<y, divisions=(2, 3, 6)> >>> dd.concat([a, b]) # doctest: +SKIP ValueError: All inputs have known divisions which cannot be concatenated in order. Specify interleave_partitions=True to ignore order
Specify interleave_partitions=True to ignore the division order.
>>> dd.concat([a, b], interleave_partitions=True) # doctest: +SKIP dd.DataFrame<concat-..., divisions=(1, 2, 3, 5, 6)>
If any of division is unknown, the result division will be unknown
>>> a # doctest: +SKIP dd.DataFrame<x, divisions=(None, None)> >>> b # doctest: +SKIP dd.DataFrame<y, divisions=(1, 4, 10)> >>> dd.concat([a, b]) # doctest: +SKIP dd.DataFrame<concat-..., divisions=(None, None, None, None)>
Different categoricals are unioned
>> dd.concat([ # doctest: +SKIP … dd.from_pandas(pd.Series([‘a’, ‘b’], dtype=’category’), 1), … dd.from_pandas(pd.Series([‘a’, ‘c’], dtype=’category’), 1), … ], interleave_partitions=True).dtype CategoricalDtype(categories=[‘a’, ‘b’, ‘c’], ordered=False)
- When axis=0 (default), concatenate DataFrames row-wise:
-
dask.dataframe.multi.
merge
(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)¶ Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
Parameters: left : DataFrame
right : DataFrame
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys
on : label or list
Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.
right_on : label or list, or array-like
Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.
left_index : boolean, default False
Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels
right_index : boolean, default False
Use the index from the right DataFrame as the join key. Same caveats as left_index
sort : boolean, default False
Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword)
suffixes : 2-length sequence (tuple, list, …)
Suffix to apply to overlapping column names in the left and right side, respectively
copy : boolean, default True
If False, do not copy data unnecessarily
indicator : boolean or string, default False
If True, adds a column to output DataFrame called “_merge” with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.
validate : string, default None
If specified, checks if merge is of specified type.
- “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
- “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
- “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
- “many_to_many” or “m:m”: allowed, but does not result in checks.
New in version 0.21.0.
Returns: merged : DataFrame
The output type will the be same as ‘left’, if it is a subclass of DataFrame.
See also
merge_ordered
,merge_asof
,DataFrame.join
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0
Examples
>>> A >>> B lkey value rkey value 0 foo 1 0 foo 5 1 bar 2 1 bar 6 2 baz 3 2 qux 7 3 foo 4 3 bar 8
>>> A.merge(B, left_on='lkey', right_on='rkey', how='outer') lkey value_x rkey value_y 0 foo 1 foo 5 1 foo 4 foo 5 2 bar 2 bar 6 3 bar 2 bar 8 4 baz 3 NaN NaN 5 NaN NaN qux 7