Contingency tables

Statsmodels supports a variety of approaches for analyzing contingency tables, including methods for assessing independence, symmetry, homogeneity, and methods for working with collections of tables from a stratified population.

The methods described here are mainly for two-way tables. Multi-way tables can be analyzed using log-linear models. Statsmodels does not currently have a dedicated API for loglinear modeling, but Poisson regression in statsmodels.genmod.GLM can be used for this purpose.

A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with \(r\) levels and one with \(c\) levels, then we have a \(r \times c\) contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. \(T_{ij}\) is the number of observations that have level \(i\) for the first variable and level \(j\) for the second variable. Note that each variable must have a finite number of levels (or categories), which can be either ordered or unordered. In different contexts, the variables defining the axes of a contingency table may be called categorical variables or factor variables. They may be either nominal (if their levels are unordered) or ordinal (if their levels are ordered).

The underlying population for a contingency table is described by a distribution table \(P_{i, j}\). The elements of \(P\) are probabilities, and the sum of all elements in \(P\) is 1. Methods for analyzing contingency tables use the data in \(T\) to learn about properties of \(P\).

The statsmodels.stats.Table is the most basic class for working with contingency tables. We can create a Table object directly from any rectangular array-like object containing the contingency table cell counts:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import statsmodels.api as sm

In [4]: df = sm.datasets.get_rdataset("Arthritis", "vcd", cache=True).data

In [5]: tab = pd.crosstab(df['Treatment'], df['Improved'])

In [6]: tab = tab.loc[:, ["None", "Some", "Marked"]]

In [7]: table = sm.stats.Table(tab)

Alternatively, we can pass the raw data and let the Table class construct the array of cell counts for us:

In [8]: table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])

Independence

Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:

\[\]

P_{ij} = sum_k P_{ij} cdot sum_k P_{kj} forall i, j

We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:

In [9]: print(table.table_orig)
Improved   Marked  None  Some
Treatment                    
Placebo         7    29     7
Treated        21    13     7

In [10]: print(table.fittedvalues)
Improved      Marked  None      Some
Treatment                           
Placebo    14.333333  21.5  7.166667
Treated    13.666667  20.5  6.833333

In [11]: print(table.resid_pearson)
Improved     Marked      None      Some
Treatment                              
Placebo   -1.936992  1.617492 -0.062257
Treated    1.983673 -1.656473  0.063758

In this example, compared to a sample from a population in which the rows and columns are independent, we have too many observations in the placebo/no improvement and treatment/marked improvement cells, and too few observations in the placebo/marked improvement and treated/no improvement cells. This reflects the apparent benefits of the treatment.

If the rows and columns of a table are unordered (i.e. are nominal factors), then the most common approach for formally assessing independence is using Pearson’s \(\chi^2\) statistic. It’s often useful to look at the cell-wise contributions to the \(\chi^2\) statistic to see where the evidence for dependence is coming from.

In [12]: rslt = table.test_nominal_association()

In [13]: print(rslt.pvalue)
0.0014626434089526352

In [14]: print(table.chi2_contribs)
Improved     Marked      None      Some
Treatment                              
Placebo    3.751938  2.616279  0.003876
Treated    3.934959  2.743902  0.004065

For tables with ordered row and column factors, we can us the linear by linear association test to obtain more power against alternative hypotheses that respect the ordering. The test statistic for the linear by linear association test is

\[\]

sum_k r_i c_j T_{ij}

where \(r_i\) and \(c_j\) are row and column scores. Often these scores are set to the sequences 0, 1, …. This gives the ‘Cochran-Armitage trend test’.

In [15]: rslt = table.test_ordinal_association()

In [16]: print(rslt.pvalue)
0.023644578093923983

We can assess the association in a \(r\times x\) table by constructing a series of \(2\times 2\) tables and calculating their odds ratios. There are two ways to do this. The local odds ratios construct \(2\times 2\) tables from adjacent row and column categories.

In [17]: print(table.local_oddsratios)
Improved     Marked      None  Some
Treatment                          
Placebo    0.149425  2.230769   NaN
Treated         NaN       NaN   NaN

In [18]: taloc = sm.stats.Table2x2(np.asarray([[7, 29], [21, 13]]))

In [19]: print(taloc.oddsratio)
0.14942528735632185

In [20]: taloc = sm.stats.Table2x2(np.asarray([[29, 7], [13, 7]]))

In [21]: print(taloc.oddsratio)
2.230769230769231

The cumulative odds ratios construct \(2\times 2\) tables by dichotomizing the row and column factors at each possible point.

In [22]: print(table.cumulative_oddsratios)
Improved     Marked      None  Some
Treatment                          
Placebo    0.185185  1.058824   NaN
Treated         NaN       NaN   NaN

In [23]: tab1 = np.asarray([[7, 29 + 7], [21, 13 + 7]])

In [24]: tacum = sm.stats.Table2x2(tab1)

In [25]: print(tacum.oddsratio)
0.18518518518518517

In [26]: tab1 = np.asarray([[7 + 29, 7], [21 + 13, 7]])

In [27]: tacum = sm.stats.Table2x2(tab1)

In [28]: print(tacum.oddsratio)
1.0588235294117647

A mosaic plot is a graphical approach to informally assessing dependence in two-way tables.

from statsmodels.graphics.mosaicplot import mosaic
mosaic(data)

Symmetry and homogeneity

Symmetry is the property that \(P_{i, j} = P_{j, i}\) for every \(i\) and \(j\). Homogeneity is the property that the marginal distribution of the row factor and the column factor are identical, meaning that

\[\]

sum_j P_{ij} = sum_j P_{ji} forall i

Note that for these properties to be applicable the table \(P\) (and \(T\)) must be square, and the row and column categories must be identical and must occur in the same order.

To illustrate, we load a data set, create a contingency table, and calculate the row and column margins. The Table class contains methods for analyzing \(r \times c\) contingency tables. The data set loaded below contains assessments of visual acuity in people’s left and right eyes. We first load the data and create a contingency table.

In [29]: df = sm.datasets.get_rdataset("VisualAcuity", "vcd", cache=True).data
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1323                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1324                           encode_chunked=req.has_header('Transfer-encoding'))
   1325             except OSError as err: # timeout error

/usr/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1243         """Send a complete request to the server."""
-> 1244         self._send_request(method, url, body, headers, encode_chunked)
   1245 

/usr/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1289             body = _encode(body, 'body')
-> 1290         self.endheaders(body, encode_chunked=encode_chunked)
   1291 

/usr/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
   1238             raise CannotSendHeader()
-> 1239         self._send_output(message_body, encode_chunked=encode_chunked)
   1240 

/usr/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
   1025         del self._buffer[:]
-> 1026         self.send(msg)
   1027 

/usr/lib/python3.7/http/client.py in send(self, data)
    965             if self.auto_open:
--> 966                 self.connect()
    967             else:

/usr/lib/python3.7/http/client.py in connect(self)
   1398 
-> 1399             super().connect()
   1400 

/usr/lib/python3.7/http/client.py in connect(self)
    937         self.sock = self._create_connection(
--> 938             (self.host,self.port), self.timeout, self.source_address)
    939         self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

/usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
    706     err = None
--> 707     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    708         af, socktype, proto, canonname, sa = res

/usr/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    747     addrlist = []
--> 748     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    749         af, socktype, proto, canonname, sa = res

gaierror: [Errno -3] Временный сбой в разрешении имен

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-29-d8333d293e90> in <module>()
----> 1 df = sm.datasets.get_rdataset("VisualAcuity", "vcd", cache=True).data

/opt/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
    288                      "master/doc/"+package+"/rst/")
    289     cache = _get_cache(cache)
--> 290     data, from_cache = _get_data(data_base_url, dataname, cache)
    291     data = read_csv(data, index_col=0)
    292     data = _maybe_reset_index(data)

/opt/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
    219     url = base_url + (dataname + ".%s") % extension
    220     try:
--> 221         data, from_cache = _urlopen_cached(url, cache)
    222     except HTTPError as err:
    223         if '404' in str(err):

/opt/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
    210     # not using the cache or didn't find it in cache
    211     if not from_cache:
--> 212         data = urlopen(url).read()
    213         if cache is not None:  # then put it in the cache
    214             _cache_it(data, cache_path)

/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    523             req = meth(req)
    524 
--> 525         response = self._open(req, data)
    526 
    527         # post-process response

/usr/lib/python3.7/urllib/request.py in _open(self, req, data)
    541         protocol = req.type
    542         result = self._call_chain(self.handle_open, protocol, protocol +
--> 543                                   '_open', req)
    544         if result:
    545             return result

/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

/usr/lib/python3.7/urllib/request.py in https_open(self, req)
   1365         def https_open(self, req):
   1366             return self.do_open(http.client.HTTPSConnection, req,
-> 1367                 context=self._context, check_hostname=self._check_hostname)
   1368 
   1369         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1324                           encode_chunked=req.has_header('Transfer-encoding'))
   1325             except OSError as err: # timeout error
-> 1326                 raise URLError(err)
   1327             r = h.getresponse()
   1328         except:

URLError: <urlopen error [Errno -3] Временный сбой в разрешении имен>

In [30]: df = df.loc[df.gender == "female", :]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-30-cefa155c8818> in <module>()
----> 1 df = df.loc[df.gender == "female", :]

/usr/lib/python3/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   4376             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   4377                 return self[name]
-> 4378             return object.__getattribute__(self, name)
   4379 
   4380     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'gender'

In [31]: tab = df.set_index(['left', 'right'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3077             try:
-> 3078                 return self._engine.get_loc(key)
   3079             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'left'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-31-a9a89cea7be4> in <module>()
----> 1 tab = df.set_index(['left', 'right'])

/usr/lib/python3/dist-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   3907                 names.append(None)
   3908             else:
-> 3909                 level = frame[col]._values
   3910                 names.append(col)
   3911                 if drop:

/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2686             return self._getitem_multilevel(key)
   2687         else:
-> 2688             return self._getitem_column(key)
   2689 
   2690     def _getitem_column(self, key):

/usr/lib/python3/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
   2693         # get column
   2694         if self.columns.is_unique:
-> 2695             return self._get_item_cache(key)
   2696 
   2697         # duplicate columns & possible reduce dimensionality

/usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
   2489         res = cache.get(item)
   2490         if res is None:
-> 2491             values = self._data.get(item)
   2492             res = self._box_item_values(item, values)
   2493             cache[item] = res

/usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item, fastpath)
   4113 
   4114             if not isna(item):
-> 4115                 loc = self.items.get_loc(item)
   4116             else:
   4117                 indexer = np.arange(len(self.items))[isna(self.items)]

/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3078                 return self._engine.get_loc(key)
   3079             except KeyError:
-> 3080                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3081 
   3082         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'left'

In [32]: del tab["gender"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3077             try:
-> 3078                 return self._engine.get_loc(key)
   3079             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'gender'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-32-21569d1674ca> in <module>()
----> 1 del tab["gender"]

/usr/lib/python3/dist-packages/pandas/core/generic.py in __delitem__(self, key)
   2743             # there was no match, this call should raise the appropriate
   2744             # exception:
-> 2745             self._data.delete(key)
   2746 
   2747         # delete from the caches

/usr/lib/python3/dist-packages/pandas/core/internals.py in delete(self, item)
   4172         Delete selected item (items if non-unique) in-place.
   4173         """
-> 4174         indexer = self.items.get_loc(item)
   4175 
   4176         is_deleted = np.zeros(self.shape[0], dtype=np.bool_)

/usr/lib/python3/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3078                 return self._engine.get_loc(key)
   3079             except KeyError:
-> 3080                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   3081 
   3082         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'gender'

In [33]: tab = tab.unstack()

In [34]: tab.columns = tab.columns.get_level_values(1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-63fa72c45fa5> in <module>()
----> 1 tab.columns = tab.columns.get_level_values(1)

/usr/lib/python3/dist-packages/pandas/core/generic.py in __getattr__(self, name)
   4372         if (name in self._internal_names_set or name in self._metadata or
   4373                 name in self._accessors):
-> 4374             return object.__getattribute__(self, name)
   4375         else:
   4376             if self._info_axis._can_hold_identifiers_and_holds_name(name):

AttributeError: 'Series' object has no attribute 'columns'

In [35]: print(tab)
Improved  Treatment
None      Placebo      29
          Treated      13
Some      Placebo       7
          Treated       7
Marked    Placebo       7
          Treated      21
dtype: int64

Next we create a SquareTable object from the contingency table.

In [36]: sqtab = sm.stats.SquareTable(tab)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-5f46247189bc> in <module>()
----> 1 sqtab = sm.stats.SquareTable(tab)

/opt/build/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/stats/contingency_tables.py in __init__(self, table, shift_zeros)
    423     def __init__(self, table, shift_zeros=True):
    424         table = _make_df_square(table) # Non-pandas passes through
--> 425         k1, k2 = table.shape
    426         if k1 != k2:
    427             raise ValueError('table must be square')

ValueError: not enough values to unpack (expected 2, got 1)

In [37]: row, col = sqtab.marginal_probabilities
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-37-71324058bc5a> in <module>()
----> 1 row, col = sqtab.marginal_probabilities

NameError: name 'sqtab' is not defined

In [38]: print(row)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-38-e1bf2d8dc4a4> in <module>()
----> 1 print(row)

NameError: name 'row' is not defined

In [39]: print(col)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-39-9a769a44a6f2> in <module>()
----> 1 print(col)

NameError: name 'col' is not defined

The summary method prints results for the symmetry and homogeneity testing procedures.

In [40]: print(sqtab.summary())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-40-8d0c6fc13a84> in <module>()
----> 1 print(sqtab.summary())

NameError: name 'sqtab' is not defined

If we had the individual case records in a dataframe called data, we could also perform the same analysis by passing the raw data using the SquareTable.from_data class method.

sqtab = sm.stats.SquareTable.from_data(data[['left', 'right']])
print(sqtab.summary())

A single 2x2 table

Several methods for working with individual 2x2 tables are provided in the sm.stats.Table2x2 class. The summary method displays several measures of association between the rows and columns of the table.

In [41]: table = np.asarray([[35, 21], [25, 58]])

In [42]: t22 = sm.stats.Table2x2(table)

In [43]: print(t22.summary())
               Estimate   SE   LCB   UCB  p-value
-------------------------------------------------
Odds ratio        3.867       1.890 7.912   0.000
Log odds ratio    1.352 0.365 0.636 2.068   0.000
Risk ratio        2.075       0.636 2.068   0.000
Log risk ratio    0.730 0.197 0.345 1.115   0.000
-------------------------------------------------

Note that the risk ratio is not symmetric so different results will be obtained if the transposed table is analyzed.

In [44]: table = np.asarray([[35, 21], [25, 58]])

In [45]: t22 = sm.stats.Table2x2(table.T)

In [46]: print(t22.summary())
               Estimate   SE   LCB   UCB  p-value
-------------------------------------------------
Odds ratio        3.867       1.890 7.912   0.000
Log odds ratio    1.352 0.365 0.636 2.068   0.000
Risk ratio        2.194       0.636 2.068   0.000
Log risk ratio    0.786 0.216 0.362 1.210   0.000
-------------------------------------------------

Stratified 2x2 tables

Stratification occurs when we have a collection of contingency tables defined by the same row and column factors. In the example below, we have a collection of 2x2 tables reflecting the joint distribution of smoking and lung cancer in each of several regions of China. It is possible that the tables all have a common odds ratio, even while the marginal probabilities vary among the strata. The ‘Breslow-Day’ procedure tests whether the data are consistent with a common odds ratio. It appears below as the Test of constant OR. The Mantel-Haenszel procedure tests whether this common odds ratio is equal to one. It appears below as the Test of OR=1. It is also possible to estimate the common odds and risk ratios and obtain confidence intervals for them. The summary method displays all of these results. Individual results can be obtained from the class methods and attributes.

In [47]: data = sm.datasets.china_smoking.load()

In [48]: mat = np.asarray(data.data)

In [49]: tables = [np.reshape(list(x)[1:], (2, 2)) for x in mat]

In [50]: st = sm.stats.StratifiedTable(tables)

In [51]: print(st.summary())
                   Estimate   LCB    UCB 
-----------------------------------------
Pooled odds           2.174   1.984 2.383
Pooled log odds       0.777   0.685 0.868
Pooled risk ratio     1.519              
                                         
                 Statistic P-value 
-----------------------------------
Test of OR=1       280.138   0.000 
Test constant OR     5.200   0.636 
                       
-----------------------
Number of tables    8  
Min n             213  
Max n            2900  
Avg n            1052  
Total n          8419  
-----------------------

Module Reference

Table(table[, shift_zeros]) Analyses that can be performed on a two-way contingency table.
Table2x2(table[, shift_zeros]) Analyses that can be performed on a 2x2 contingency table.
SquareTable(table[, shift_zeros]) Methods for analyzing a square contingency table.
StratifiedTable(tables[, shift_zeros]) Analyses for a collection of 2x2 contingency tables.
mcnemar(table[, exact, correction]) McNemar test of homogeneity.
cochrans_q(x[, return_object]) Cochran’s Q test for identical binomial proportions.

See also

Scipy has several functions for analyzing contingency tables, including Fisher’s exact test which is not currently in Statsmodels.