delta package

Submodules

delta.cluster module

Clustering of distance matrixes.

Clustering represents a hierarchical clustering which can be flattened using Clustering.fcluster(), the flattened clustering is then represented by FlatClustering.

If supported by the installed version of scikit-learn, there is also a KMedoidsClustering.

class delta.cluster.Clustering(distance_matrix, method='ward', **kwargs)[source]

Bases: object

Represents a hierarchical clustering.

Note

This is subject to refactoring once we implement more clustering methods

fclustering()[source]

Returns a default flat clustering from the hierarchical version.

This method uses the DocumentDescriber to determine the groups, and uses the number of groups as a maxclust criterion.

Returns

A properly initialized representation of the flat clustering.

Return type

FlatClustering

class delta.cluster.FlatClustering(distances, clusters=None, metadata=None, **kwargs)[source]

Bases: object

A flat clustering represents a non-hierarchical clustering.

Notes

FlatClustering uses a data frame field called data to store the actual clustering. This field will have the same index as the distance matrix, and three columns labeled Group, GroupID, and Cluster. Group will be the group label returned by the DocumentDescriber we use, GroupID a numerical ID for each group (to be used as ground truth) and Cluster the numerical ID of the actual cluster associated by the clustering algorithm.

As long as FlatClusterings initialized property is False, the Clustering is not assigned yet.

set_clusters(clusters)[source]
static ngroups(df)[source]

With df being a data frame that has a Group column, return the number of different authors in df.

cluster_errors()[source]

Calculates the number of cluster errors by:

  1. calculating the total number of different authors in the set

  2. calling sch.fcluster to generate at most that many flat clusters

  3. for each of those clusters, the cluster errors are the number of authors in this cluster - 1

  4. sum of each cluster’s errors = result

purity()[source]

To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by $N$

entropy()[source]

Smaller entropy values suggest a better clustering.

adjusted_rand_index()[source]

Calculates the Adjusted Rand Index for the given flat clustering http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score

homogeneity_completeness_v_measure()[source]
evaluate()[source]
Returns

All scores for the current clustering

Return type

pandas.Series

clusters(labeled=False)[source]

Documents by cluster.

Parameters

labeled (bool) – If True, represent each document by its label as calculated by the DocumentDescriber. This is typically a human-readable, shortened description

Returns

Maps each cluster number to a list of documents.

Return type

dict

describe()[source]

Returns a description of the current flat clustering.

class delta.cluster.KMedoidsClustering_distances(distances, n_clusters=None, metadata=None, **kwargs)[source]

Bases: delta.cluster.FlatClustering

class delta.cluster.KMedoidsClustering(corpus, delta, n_clusters=None, extra_args={}, metadata=None, **kwargs)[source]

Bases: delta.cluster.FlatClustering

delta.corpus module

The delta.corpus module contains code for building, loading, saving, and manipulating the representation of a corpus. Its heart is the Corpus class which represents the feature matrix. Also contained are default implementations for reading and tokenizing files and creating a feature vector out of that.

class delta.corpus.FeatureGenerator(lower_case: bool = False, encoding: str = 'utf-8', glob: str = '*.txt', skip: Optional[str] = None, token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0), max_tokens: Optional[int] = None, ngrams: Optional[int] = None, parallel: Union[int, bool, joblib.parallel.Parallel] = False, sort: str = 'documents', sparse: bool = False)[source]

Bases: object

A feature generator is responsible for converting a subdirectory of files into a feature matrix (that will then become a corpus). If you need to customize the feature extraction process, create a custom feature generator and pass it into your Corpus constructor call along with its subdir argument.

The default feature generator is able to process a directory of text files, tokenize each of the text files according to a regular expression, and count each token type for each file. To customize feature extraction, you have two options:

  1. for simple customizations, just create a new FeatureGenerator and set the constructor arguments accordingly. Look in the docstring for __init__() for details.

  2. in more complex cases, create a subclass and override methods as you see fit.

On a feature generator passed in to Corpus, only two methods will be called:

  • __call__(), i.e. the object as a callable, to actually generate

    the feature vector,

  • metadata to obtain metadata fields that will be included in

    the corresponding corpus.

So, if you wish to write a completely new feature generator, you can ignore the other methods.

Parameters
  • lower_case (bool) – if True, normalize all tokens to lower case before counting them

  • encoding (str) – the encoding to use when reading files

  • glob (str) – the pattern inside the subdirectory to find files.

  • skip (str) – don’t handle files that match this pattern

  • token_pattern (re.Regex) – The regular expression used to identify tokens. The default, LETTERS_PATTERN, will simply find sequences of unicode letters. WORD_PATTERN will find the shortest sequence of letters and apostrophes between two word boundaries (according to the simple word-boundary algorithm from Unicode regular expressions) that contains at least one letter.

  • max_tokens (int) – If set, stop reading each file after that many words.

  • ngrams (int) – Count token ngrams instead of single tokens

  • parallel (bool, int, Parallel) – If truish, read and parse files in parallel. The actual argument may be - None or False for no special processing - an int for the required number of jobs - a dictionary with Parallel arguments for finer control

  • sort (str) – Sort the final feature matrix by index before returning. Possible values: - documents, index: Sort by document names - features, columns: sort by feature labels (ie words) - both: sort along both axes - None or the empty string: Do not sort

  • sparse (bool) – build a sparse dataframe. Requires Pandas >=1.0

lower_case: bool = False
encoding: str = 'utf-8'
glob: str = '*.txt'
skip: Optional[str] = None
token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0)
max_tokens: Optional[int] = None
ngrams: Optional[int] = None
parallel: Union[int, bool, joblib.parallel.Parallel] = False
sort: str = 'documents'
sparse: bool = False
logger = <Logger delta.corpus.FeatureGenerator (WARNING)>
tokenize(lines)[source]

Tokenizes the given lines.

This method is called by count_tokens(). The default implementation will return an iterable of all tokens in the given :param:`lines` that matches the token_pattern. The result of this method can further be postprocessed by postprocess_tokens().

Parameters

lines – Iterable of strings in which to look for tokens.

Returns

Iterable (default implementation generator) of tokens

postprocess_tokens(tokens)[source]

Postprocesses the tokens according to the options provided when creating the feature generator..

Currently respects lower_case and ngrams. This is called by count_tokens after tokenizing.

Parameters

tokens – iterable of tokens as returned by tokenize()

Returns

iterable of postprocessed tokens

count_tokens(lines)[source]

This calls tokenize() to split the iterable lines into tokens. If the lower_case attribute is given, the tokens are then converted to lower_case. The tokens are counted, the method returns a pd.Series mapping each token to its number of occurrences.

This is called by process_file().

Parameters

lines – Iterable of strings in which to look for tokens.

Returns

maps tokens to the number of occurrences.

Return type

pandas.Series

get_name(filename)[source]

Converts a single file name to a label for the corresponding feature vector.

Returns

Feature vector label (filename w/o extension by default)

Return type

str

process_file(filename)[source]

Processes a single file to a feature vector.

The default implementation reads the file pointed to by filename as a text file, calls count_tokens() to create token counts and get_name() to calculate the label for the feature vector.

Parameters

filename (str) – The path to the file to process

Returns

Feature counts, its name set according to

get_name()

Return type

pd.Series

process_directory(directory)[source]

Iterates through the given directory and runs process_file() for each file matching glob in there.

Parameters

directory (str) – Path to the directory to process

Returns

mapping name to pd:Series

Return type

dict

property metadata

Returns: Metadata: metadata record that describes the parameters of the

features used for corpora created using this feature generator.

class delta.corpus.SimpleFeatureGenerator(lower_case: bool = False, encoding: str = 'utf-8', glob: str = '*.txt', skip: Optional[str] = None, token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0), max_tokens: Optional[int] = None, ngrams: Optional[int] = None, parallel: Union[int, bool, joblib.parallel.Parallel] = False, sort: str = 'documents', sparse: bool = False)[source]

Bases: delta.corpus.FeatureGenerator

A simplified, faster version of the FeatureGenerator.

With respect to feature generation the behaviour is the same as with FeatureGenerator, but it is slightly less flexible with respect to subclassing. It does not read the files linewise, and it never creates pd.Series().

preprocess_text(text)[source]
postprocess_tokens(tokens)[source]

Postprocesses the tokens according to the options provided when creating the feature generator..

Currently respects lower_case and ngrams. This is called by count_tokens after tokenizing.

Parameters

tokens – iterable of tokens as returned by tokenize()

Returns

iterable of postprocessed tokens

process_file(filename)[source]

Processes a single file to a feature vector.

The default implementation reads the file pointed to by filename as a text file, calls count_tokens() to create token counts and get_name() to calculate the label for the feature vector.

Parameters

filename (str) – The path to the file to process

Returns

Feature counts, its name set according to

get_name()

Return type

pd.Series

exception delta.corpus.CorpusNotComplete(msg='Corpus not complete anymore')[source]

Bases: ValueError

exception delta.corpus.CorpusNotAbsolute(operation)[source]

Bases: delta.corpus.CorpusNotComplete

class delta.corpus.Corpus(source=None, *, subdir=None, file=None, corpus=None, feature_generator=None, document_describer=<delta.util.DefaultDocumentDescriber object>, metadata=None, **kwargs)[source]

Bases: pandas.core.frame.DataFrame

Creates a new Corpus.

You can create a corpus either from a filesystem subdir with raw text files, or from a CSV file with a document-term matrix, or from another corpus or dataframe that contains (potentially preprocessed) document/term vectors. Either option may be passed via appropriately named keyword argument or as the only positional argument, but exactly one must be present.

If you pass a subdirectory, Corpus will call a FeatureGenerator to read and parse the files and to generate a default word count. The default implementation will search for plain text files *.txt inside the directory and parse them using a simple regular expression. It has a few options, e.g., glob and lower_case, that can also be passed directly to corpus as keyword arguments. E.g., Corpus('texts', glob='plain-*.txt', lower_case=True) will look for files called plain-xxx.txt and convert it to lower case before tokenizing. See FeatureGenerator for more details.

The document_describer can contain per-document metadata which can be used, e.g, as ground truth.

The metadata record contains global metadata (e.g., which transformations have already been performed), they will be inherited from a corpus argument, all additional keyword arguments will be included with this record.

Parameters

source – Positional variant of either subdir, file, or corpus

Keyword Arguments
  • subdir (str) – Path to a subdirectory containing the (unprocessed) corpus data.

  • file (str) – Path to a CSV file containing the feature vectors.

  • corpus (pandas.DataFrame) – A dataframe or Corpus from which to create a new corpus, as a copy.

  • feature_generator (FeatureGenerator) – A customizeable helper class that will process a subdir to a feature matrix, if the subdir argument is also given. If None, a default feature generator will be used.

  • metadata (dict) – A dictionary with metadata to copy into the new corpus.

  • **kwargs – Additionally, if feature_generator is None and subdir is not None, you can pass FeatureGenerator arguments and they will be used when instantiating the feature generator Additional keyword arguments will be set in the metadata record of the new corpus.

Warning

You should either use a single positional argument (source) or one of subdir, file, or corpus as keyword arguments. In future versions, source will be positional-only.

new_data(data, **metadata)[source]

Wraps the given DataFrame with metadata from this corpus object.

Parameters
  • data (pandas.DataFrame) – Raw data that is derived by, e.g., pandas filter operations

  • **metadata – Metadata fields that should be changed / modified

save(filename='corpus_words.csv')[source]

Saves the corpus to a CSV file.

The corpus will be saved to a CSV file containing documents in the columns and features in the rows, i.e. a transposed representation. Document and feature labels will be saved to the first row or column, respectively.

A metadata file will be saved alongside the file.

Parameters

filename (str) – The target file.

is_absolute()bool[source]
Returns

True if this is a corpus using absolute frequencies

Return type

bool

is_complete()bool[source]

A corpus is complete as long as it contains the absolute frequencies of all features of all documents. Many operations like calculating the relative frequencies require a complete corpus. Once a corpus has lost its completeness, it is not possible to restore it.

get_mfw_table(mfwords)[source]

Shortens the list to the given number of most frequent words and converts the word counts to relative frequencies.

This returns a new Corpus, the data in this object is not modified.

Parameters

mfwords (int) – number of most frequent words in the new corpus. 0 means all words.

Returns

a new sorted corpus shortened to mfwords

Return type

Corpus

top_n(mfwords)[source]

Returns a new Corpus that contains the top n features.

Parameters

mfwords (int) – Number of most frequent items in the new corpus.

Returns

a new corpus shortened to mfwords

Return type

Corpus

save_wordlist(filename, **kwargs)[source]

Saves the current word list to a text file.

Parameters
  • filename (str) – Path to the file to save

  • kwargs – Additional arguments to pass to open()

filter_wordlist(filename, **kwargs)[source]

Returns a new corpus that contains the features from the given file.

This method will read the list of words from the given file and then return a new corpus that uses the features listed in the file, in the order they are in the file.

Parameters

filename (str) – Path to the file to load. Each line contains one feature. Leading and trailing whitespace, lines starting with #, and empty lines are ignored.

Returns

New corpus with seelected features.

filter_features(features, **metadata)[source]

Returns a new corpus that contains only the given features.

Parameters

features (Iterable) – The features to select. If its in a file, use filter_wordlist

relative_frequencies()[source]
z_scores()[source]
cull(ratio=None, threshold=None, keepna=False)[source]

Removes all features that do not appear in a minimum number of documents.

Parameters
  • ratio (float) – Minimum ratio of documents a word must occur in to be retained. Note that we’re always rounding towards the ceiling, i.e. if the corpus contains 10 documents and ratio=1/3, a word must occur in at least 4 documents (if this is >= 1, it is interpreted as threshold)

  • threshold (int) – Minimum number of documents a word must occur in to be retained

  • keepna (bool) – If set to True, the missing words in the returned corpus will be retained as nan instead of 0.

Returns

A new corpus witht the culled words removed. The original

corpus is left unchanged.

Return type

Corpus

reparse(feature_generator, subdir=None, **kwargs)[source]

Parse or re-parse a set of documents with different settings.

This runs the given feature generator on the given or configured subdirectory. The feature vectors returned by the feature generator will replace or augment the corpus.

Parameters
  • feature_generator (FeatureGenerator) – Will be used for extracting stuff.

  • subdir (str) – If given, will be passed to the feature generator for processing. Otherwise, we’ll use the subdir configured with this corpus.

  • **kwargs – Additional metadata for the returned corpus.

Returns

a new corpus with the respective columns replaced or added.

The current object will be left unchanged.

Return type

Corpus

Raises

CorpusNotAbsolute – if called on a corpus with relative frequencies

tokens()pandas.core.series.Series[source]

Number tokens by text

types()pandas.core.series.Series[source]

Number of different features by text

ttr()float[source]

Type/token ratio for the whole corpus.

ttr_by_text()pandas.core.series.Series[source]

Type/token ratio for each text.

delta.deltas module

This module contains the actual delta measures.

Normalizations

A normalization is a function that works on a Corpus and returns a somewhat normalized version of that corpus. Each normalization has the following additional attributes:

  • name – an identifier for the normalization, usually the function name

  • title – an optional, human-readable name for the normalization

Each normalization leaves its name in the ‘normalizations’ field of the corpus’ Metadata.

All available normalizations need to be registered to the normalization registry.

Delta Functions

A delta function takes a Corpus and creates a Distances table from that. Each delta function has the following properties:

  • descriptor – a systematic descriptor of the distance function. For simple

    delta functions (see below), this is simply the name. For composite distance functions, this starts with the name of a simple delta function and is followed by a list of normalizations (in order) that are applied to the corpus before applying the distance function.

  • name – a unique name for the distance function

  • title – an optional, human-readable name for the distance function.

Simple Delta Functions

Simple delta functions are functions that

class delta.deltas.Normalization(f, name=None, title=None, register=True)[source]

Bases: object

Wrapper for normalizations.

delta.deltas.normalization(*args, **kwargs)[source]

Decorator that creates a Normalization from a function or (callable) object. Can be used without or with keyword arguments:

name (str): Name (identifier) for the normalization. By default, the function’s name is used. title (str): Human-readable title for the normalization.

class delta.deltas.DeltaFunction(f=None, descriptor=None, name=None, title=None, register=True)[source]

Bases: object

Abstract base class of a delta function.

To define a delta function, you have various options:

  1. subclass DeltaFunction and override its __call__() method with something that directly handles a Corpus.

  2. subclass DeltaFunction and override its distance() method with a distance function

  3. instantiate DeltaFunction and pass it a distance function, or use the delta() decorator

  4. use one of the subclasses

Creates a custom delta function.

Parameters
  • f (function) – a distance function that calculates the difference between two feature vectors and returns a float. If passed, this will be used for the implementation.

  • name (str) – The name/id of this function. Can be inferred from f or descriptor.

  • descriptor (str) – The descriptor to identify this function.

  • title (str) – A human-readable title for this function.

  • register (bool) – If true (default), register this delta function with the function registry on instantiation.

static distance(u, v, *args, **kwargs)[source]

Calculate a distance between two feature vectors.

This is an abstract method, you must either inherit from DeltaFunction and override distance or assign a function in order to use this.

Parameters
  • u (pandas.Series) – The documents to compare.

  • v (pandas.Series) – The documents to compare.

  • *args – Passed through from the caller

  • **kwargs

    Passed through from the caller

Returns

Distance between the documents.

Return type

float

Raises

NotImplementedError if no implementation is provided.

register()[source]

Registers this delta function with the global function registry.

iterate_distance(corpus, *args, **kwargs)[source]

Calculates the distance matrix for the given corpus.

The default implementation will iterate over all pairwise combinations of the documents in the given corpus and call distance() on each pair, passing on the additional arguments.

Clients may want to use __call__() instead, i.e. they want to call this object as a function.

Parameters
  • corpus (Corpus) – feature matrix for which to calculate the distance

  • *args – further arguments for the matrix

  • **kwargs

    further arguments for the matrix

Returns

square dataframe containing pairwise distances.

The default implementation will return a matrix that has zeros on the diagonal and the lower triangle a mirror of the upper triangle.

Return type

pandas.DataFrame

create_result(df, corpus)[source]

Wraps a square dataframe to a DistanceMatrix, adding appropriate metadata from corpus and this delta function.

Parameters
Returns

df as values, appropriate metadata

Return type

DistanceMatrix

prepare(corpus)[source]

Return the corpus prepared for the metric, if applicable.

Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.

If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.

The default implementation simply returns the corpus as-is.

Raises

NotImplementedError if there is no metric

class delta.deltas.PreprocessingDeltaFunction(distance_function, prep_function, descriptor=None, name=None, title=None, register=True)[source]

Bases: delta.deltas.DeltaFunction

Creates a custom delta function.

Parameters
  • f (function) – a distance function that calculates the difference between two feature vectors and returns a float. If passed, this will be used for the implementation.

  • name (str) – The name/id of this function. Can be inferred from f or descriptor.

  • descriptor (str) – The descriptor to identify this function.

  • title (str) – A human-readable title for this function.

  • register (bool) – If true (default), register this delta function with the function registry on instantiation.

static prep_function(corpus)[source]
class delta.deltas.CompositeDeltaFunction(descriptor, name=None, title=None, register=True)[source]

Bases: delta.deltas.DeltaFunction

A composite delta function consists of a basis (which is another delta function) and a list of normalizations. It first transforms the corpus via all the given normalizations in order, and then runs the basis on the result.

Creates a new composite delta function.

Parameters
  • descriptor (str) – Formally defines this delta function. First the name of an existing, registered distance function, then, separated by -, the names of normalizations to run, in order.

  • name (str) – Name by which this delta function is registered, in addition to the descriptor

  • title (str) – human-readable title

  • register (bool) – If true (the default), register this delta function on creation

prepare(corpus)[source]

Return the corpus prepared for the metric, if applicable.

Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.

If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.

The default implementation simply returns the corpus as-is.

Raises

NotImplementedError if there is no metric

class delta.deltas.PDistDeltaFunction(metric, name=None, title=None, register=True, scale=False, **kwargs)[source]

Bases: delta.deltas.DeltaFunction

Wraps one of the metrics implemented by ssd.pdist() as a delta function.

Warning

You should use MetricDeltaFunction instead.

Parameters
  • metric (str) – The metric that should be called via ssd.pdist

  • name (str) – Name / Descriptor for the delta function, if None, metric is used

  • title (str) – Human-Readable Title

  • register (bool) – If false, don’t register this with the registry

  • **kwargs – passed on to ssd.pdist()

class delta.deltas.MetricDeltaFunction(metric, name=None, title=None, register=True, scale=False, fix_symmetry=True, **kwargs)[source]

Bases: delta.deltas.DeltaFunction

Distance functions based on scikit-learn’s sklearn.metric.pairwise_distances().

Parameters
  • metric (str) – The metric that should be called via sklearn.metric.pairwise_distances

  • name (str) – Name / Descriptor for the delta function, if None, metric is used

  • title (str) – Human-Readable Title

  • register (bool) – If false, don’t register this with the registry

  • scale (bool) – Scale by number of features

  • fix_symmetry – Force the resulting matrix to be symmetric

  • **kwargs – passed on to ssd.pdist()

Note

sklearn.metric.pairwise_distances() fast, but the result may not be exactly symmetric. The fix_symmetry option enforces symmetry by mirroring the lower-left triangle after calculating distances so, e.g., scipy clustering won’t complain.

class delta.deltas.DistanceMatrix(df, copy_from=None, metadata=None, corpus=None, document_describer=None, **kwargs)[source]

Bases: pandas.core.frame.DataFrame

classmethod from_csv(filename)[source]

Loads a distance matrix from a cross-table style csv file.

save(filename)[source]
delta_values(transpose=False, check=True)[source]

Converts the given n×n Delta matrix to a \(\binom{n}{2}\) long series of distinct delta values – i.e. duplicates from the upper triangle and zeros from the diagonal are removed.

Parameters
  • transpose – if True, transpose the dataframe first, i.e. use the upper right triangle

  • check – if True and if the result does not contain any non-null value, try the other option for transpose.

delta_values_df()[source]

Returns an stacked form of the given delta table along with additional metadata. Assumes delta is symmetric.

The dataframe returned has the columns Author1, Author2, Text1, Text2, and Delta, it has an entry for every unique combination of texts

f_ratio()[source]

Calculates the (normalized) F-ratio over the distance matrix, according to Heeringa et al.

Checks whether the distances within a group (i.e., texts with the same author) are much smaller thant the distances between groups

fisher_ld()[source]

Calculates Fisher’s Linear Discriminant for the distance matrix.

cf. Heeringa et al.

z_scores()[source]

Returns a distance matrix with the distances standardized using z-scores

partition()[source]

Splits this distance matrix into two sparse halves: the first contains only the differences between documents that are in the same group (‘in-group’), the second only the differences between documents that are in different groups.

Group associations are created according to the DocumentDescriber.

Returns

(in_group, out_group)

Return type

(DistanceMatrix, DistanceMatrix)

simple_score()[source]

Simple delta quality score for the given delta matrix:

The difference between the means of the standardized differences between works of different authors and works of the same author; i.e. different authors are considered score standard deviations more different than equal authors.

evaluate()[source]
Returns

All scores implemented for distance matrixes

Return type

pandas.Series

compare_with(doc_metadata, comparisons=None, join='inner')[source]

Compare the distance matrix value with values calculated from the given document metadata table.

Parameters
  • doc_metadata (pd.DataFrame) – a dataframe with one row per document and arbitrary columns

  • comparisons – see compare_pairwise

  • join (str) – inner (the default) or outer, if outer, keep pairs for which we have neither metadata nor comparisons.

Returns

a dataframe with a row for each pairwise document combination (as in DistanceMatrix.delta_values). The first column will contain the delta values, subsequent columns the metadata comparisons.

delta.experiments module

The experiments module can be used to perform a series of experiments in which you vary some of the arguments. Here’s the basic data model:

A _Facet_ is an aspect you wish to vary, e.g. the number of features. A facet delivers a set of _expressions_. Each expression represents the actual values of the facet, eg, “3000 most frequent words” might be an expression of the facet ‘number of features’.

There are some different kinds of facets:

A _corpus builder facet_ determines how the actual corpus is built. The corpus builder facets are used to actually assemble a constructor call to the delta.Corpus class, i.e. for every combination of expressions we get a new Corpus. Thus, variation here may be quite lengthy.

A _corpus manipulation facet_ takes an existing corpus and manipulates it, e.g., by extracting the n most frequent words. This is much faster then building the corpus anew each time, so if you can, implement a corpus manipulation facet instead of a corpus builder one.

A _method facet_ delivers a delta function that should be manipulated.

delta.features module

Feature selection utilities.

delta.features.get_rfe_features(corpus, estimator=None, steps=[(10000, 1000), (1000, 200), (500, 25)], cv=True)[source]
Parameters
  • corpus – containing document_describer,

  • estimator – supervised learning estimator,

  • steps – list of tuples (features_to_select, step)

  • cv – additional cross-validated selection.

Returns

set of selected terms.

Return type

rfe_terms

delta.graphics module

Various visualization tools.

class delta.graphics.Dendrogram(clustering, describer=None, ax=None, orientation='left', font_size=None, link_color='k', title='Corpus: {corpus}', xlabel='Delta: {delta_title}, {words} most frequent {features}')[source]

Bases: object

Creates a dendrogram representation from a hierarchical clustering.

This is a wrapper around, and an improvement to, sch.dendrogram(), tailored for the use in pydelta.

Parameters
  • clustering (Clustering) – A hierarchical clustering.

  • describer (DocumentDescriber) – Document describer used for determining the groups and the labels for the documents used (optional). By default, the document describer inherited from the clustering is used.

  • ax (mpl.axes.Axes) – Axes object to draw on. Uses pyplot default axes if not provided.

  • orientation (str) – Orientation of the dendrogram. Currently, only “right” is supported (default).

  • font_size – Font size for the label, in points. If not provided, sch.dendrogram() calculates a default.

  • link_color (str) – The color used for the links in the dendrogram, by default k (for black).

  • title (str) – a title that will be printed on the plot. The string may be a template string as supported by str.format_map() with metadata field names in curly braces, it will be evaluated against the clustering’s metadata. If you pass None here, no title will be added.

Notes

The dendrogram will be painted by matplotlib / pyplot using the default styles, which means you can use, e.g., :module:`seaborn` to influence the overall design of the image.

Dendrogram handles coloring differently than sch.dendrogram(): It will color the document labels according to the pre-assigned grouping (e.g., by author). To do so, it will build on matplotlib’s default color_cycle, and it will rotate, so if you need more colors, adjust the color_cycle accordingly.

show()[source]
save(fname, **kwargs)[source]
delta.graphics.scatterplot_delta(deltas, red_f=MDS(dissimilarity='precomputed', n_jobs=- 1))[source]

deltas: pydelta dist. matrix red_f: func for dimensionality reduction, e.g. “decomposition.PCA(n_components=2)”

return: plot?

delta.graphics.spikeplot(corpus, docs=slice(None, None, None), features=50, figsize=None, **kwargs)[source]

Prepares a spike plot of a (normalized) corpus.

Parameters
  • corpus (pandas.DataFrame) – The corpus to plot

  • docs (int, list or slice) – the documents to include in the plot, default: all documents

  • features (int, list, or slice) – the features to plot, default: top 50 features

  • figsize (2-element list) – size of the plot

  • kwargs – will be passed on to pd.DataFrame.plot()

Notes

The arguments docs and features can by either: * None, selecting all items * something you would put into corpus.index[·] or corpus.columns[·], respectively; i.e. a label indexer * an integer, selecting the first n items * a list of integers, selecting exactly those items

Returns

the plot

delta.util module

Contains utility classes and functions.

exception delta.util.MetadataException[source]

Bases: Exception

class delta.util.Metadata(*args, **kwargs)[source]

Bases: collections.abc.Mapping

A metadata record contains information about how a particular object of the pyDelta universe has been constructed, or how it will be manipulated.

Metadata fields are simply attributes, and they can be used as such.

Create a new metadata instance. Arguments will be passed on to update().

Examples

>>> m = Metadata(lower_case=True, sorted=False)
>>> Metadata(m, sorted=True, words=5000)
Metadata(lower_case=True, sorted=True, words=5000)
update(*args, **kwargs)[source]

Updates this metadata record from the arguments. Arguments may be:

  • other Metadata instances

  • objects that have metadata attribute

  • JSON strings

  • stuff that dict can update from

  • key-value pairs of new or updated metadata fields

static metafilename(filename)[source]

Returns an appropriate metadata filename for the given filename.

>>> Metadata.metafilename("foo.csv")
'foo.csv.meta'
>>> Metadata.metafilename("foo.csv.meta")
'foo.csv.meta'
classmethod load(filename)[source]

Loads a metadata instance from the filename identified by the argument.

Parameters

filename (str) – The name of the metadata file, or of the file to which a sidecar metadata filename exists

save(filename, **kwargs)[source]

Saves the metadata instance to a JSON file.

Parameters
  • filename (str) – Name of the metadata file or the source file

  • **kwargs – are passed on to json.dump()

to_json(**kwargs)[source]

Returns a JSON string containing this metadata object’s contents.

Parameters

**kwargs – Arguments passed to json.dumps()

class delta.util.DocumentDescriber[source]

Bases: object

DocumentDescribers are able to extract metadata from the document IDs of a corpus.

The idea is that a Corpus contains some sort of document name (e.g., original filenames), however, some components would be interested in information inferred from metadata. A DocumentDescriber will be able to produce this information from the document name, be it by inferring it directly (e.g., using some filename policy) or by using an external database.

This base implementation expects filenames of the format “Author_Title.ext” and returns author names as groups and titles as in-group labels.

The DefaultDocumentDescriber adds author and title shortening, and we plan a metadata based TableDocumentDescriber that uses an external metadata table.

group_name(document_name)[source]

Returns the unique name of the group the document belongs to.

The default implementation returns the part of the document name before the first _.

item_name(document_name)[source]

Returns the name of the item within the group.

The default implementation returns the part of the document name after the first _.

group_label(document_name)[source]

Returns a (maybe shortened) label for the group, for display purposes.

The default implementation just returns the group_name().

item_label(document_name)[source]

Returns a (maybe shortened) label for the item within the group, for display purposes.

The default implementation just returns the item_name().

label(document_name)[source]

Returns a label for the document (including its group).

groups(documents)[source]

Returns the names of all groups of the given list of documents.

class delta.util.DefaultDocumentDescriber[source]

Bases: delta.util.DocumentDescriber

group_label(document_name)[source]

Returns just the author’s surname.

item_label(document_name)[source]

Shortens the title to a meaningful but short string.

class delta.util.TableDocumentDescriber(table, group_col, name_col, dialect='excel', **kwargs)[source]

Bases: delta.util.DocumentDescriber

A document decriber that takes groups and item labels from an external table.

Parameters
  • table (str or pandas.DataFrame) – A table with metadata that describes the documents of the corpus, either a pandas.DataFrame or path or IO to a CSV file. The tables index (or first column for CSV files) contains the document ids that are returned by the FeatureGenerator. The columns (or first row) contains column labels.

  • group_col (str) – Name of the column in the table that contains the names of the groups. Will be used, e.g., for determining the ground truth for cluster evaluation, and for coloring the dendrograms.

  • name_col (str) – Name of the column in the table that contains the names of the individual items.

  • dialect (str or csv.Dialect) – CSV dialect to use for reading the file.

  • **kwargs – Passed on to pandas.read_table().

Raises

ValueError – when arguments inconsistent

See:

pandas.read_table

group_name(document_name)[source]

Returns the unique name of the group the document belongs to.

The default implementation returns the part of the document name before the first _.

item_name(document_name)[source]

Returns the name of the item within the group.

The default implementation returns the part of the document name after the first _.

delta.util.ngrams(iterable, n=2, sep=None)[source]

Transforms an iterable into an iterable of ngrams.

Parameters
  • iterable – Input data

  • n (int) – Size of each ngram

  • sep (str) – Separator string for the ngrams

Yields

if sep is None, this yields n-tuples of the iterable. If sep is a string, it is used to join the tuples

Example

>>> list(ngrams('This is a test'.split(), n=2, sep=' '))
['This is', 'is a', 'a test']
delta.util.compare_pairwise(df, comparisons=None)[source]

Builds a table with pairwise comparisons of specific columns in the dataframe df.

This function is intended to provide additional relative metadata to the pairwise distances of a (symmetric) DistanceMatrix. It will take a dataframe and compare its rows pairwise according to the second argument, returning a dataframe in the ‘vector’ form of :func:ssd.squareform.

If your comparisons can be expressed as np.ufuncs, this will be quite efficient.

Parameters
  • df – A dataframe. rows = instances, columns = features.

  • comparisons

    A list of comparison specs. Each spec should be either:

    1. a column name (e.g., a string) for default settings: The absolute difference (np.subtract) for numerical columns, np.equal for everything else

    2. a tuple with 2-4 entries: (source_column, ufunc [, postfunc: callable] [, target_column: str])

      • source column is the name of the column in df to compare

      • ufunc is a two-argument :class:np.ufunc which is pairwise applied to all combinations of the column

      • postfunc is a one-argument function that is applied to the final, 1D result vector

      • target_column is the name of the column in the result dataframe (if missing, source column will be used)

    If comparisons is missing, a default comparison will be created for every column

Returns

A dataframe. Will have a column for each comparison spec and a row for each unique pair in the index. The order of rows will be similar to [(i, j) for i in 0..(n-1) for j in (i+1)..(n-1)].

Example

>>> df = pd.DataFrame({'Class': ['a', 'a', 'b'], 'Size': [42, 30, 5]})
>>> compare_pairwise(df)
     Class  Size
0 1   True    12
  2  False    37
1 2  False    25
>>> compare_pairwise(df, ['Class', ('Size', np.subtract, np.absolute, 'Size_Diff'), ('Size', np.add, 'Size_Total')])
     Class  Size_Diff  Size_Total
0 1   True         12          72
  2  False         37          47
1 2  False         25          35

Module contents

pydelta library

Stylometrics in Python

class delta.Corpus(source=None, *, subdir=None, file=None, corpus=None, feature_generator=None, document_describer=<delta.util.DefaultDocumentDescriber object>, metadata=None, **kwargs)[source]

Bases: pandas.core.frame.DataFrame

Creates a new Corpus.

You can create a corpus either from a filesystem subdir with raw text files, or from a CSV file with a document-term matrix, or from another corpus or dataframe that contains (potentially preprocessed) document/term vectors. Either option may be passed via appropriately named keyword argument or as the only positional argument, but exactly one must be present.

If you pass a subdirectory, Corpus will call a FeatureGenerator to read and parse the files and to generate a default word count. The default implementation will search for plain text files *.txt inside the directory and parse them using a simple regular expression. It has a few options, e.g., glob and lower_case, that can also be passed directly to corpus as keyword arguments. E.g., Corpus('texts', glob='plain-*.txt', lower_case=True) will look for files called plain-xxx.txt and convert it to lower case before tokenizing. See FeatureGenerator for more details.

The document_describer can contain per-document metadata which can be used, e.g, as ground truth.

The metadata record contains global metadata (e.g., which transformations have already been performed), they will be inherited from a corpus argument, all additional keyword arguments will be included with this record.

Parameters

source – Positional variant of either subdir, file, or corpus

Keyword Arguments
  • subdir (str) – Path to a subdirectory containing the (unprocessed) corpus data.

  • file (str) – Path to a CSV file containing the feature vectors.

  • corpus (pandas.DataFrame) – A dataframe or Corpus from which to create a new corpus, as a copy.

  • feature_generator (FeatureGenerator) – A customizeable helper class that will process a subdir to a feature matrix, if the subdir argument is also given. If None, a default feature generator will be used.

  • metadata (dict) – A dictionary with metadata to copy into the new corpus.

  • **kwargs – Additionally, if feature_generator is None and subdir is not None, you can pass FeatureGenerator arguments and they will be used when instantiating the feature generator Additional keyword arguments will be set in the metadata record of the new corpus.

Warning

You should either use a single positional argument (source) or one of subdir, file, or corpus as keyword arguments. In future versions, source will be positional-only.

new_data(data, **metadata)[source]

Wraps the given DataFrame with metadata from this corpus object.

Parameters
  • data (pandas.DataFrame) – Raw data that is derived by, e.g., pandas filter operations

  • **metadata – Metadata fields that should be changed / modified

save(filename='corpus_words.csv')[source]

Saves the corpus to a CSV file.

The corpus will be saved to a CSV file containing documents in the columns and features in the rows, i.e. a transposed representation. Document and feature labels will be saved to the first row or column, respectively.

A metadata file will be saved alongside the file.

Parameters

filename (str) – The target file.

is_absolute()bool[source]
Returns

True if this is a corpus using absolute frequencies

Return type

bool

is_complete()bool[source]

A corpus is complete as long as it contains the absolute frequencies of all features of all documents. Many operations like calculating the relative frequencies require a complete corpus. Once a corpus has lost its completeness, it is not possible to restore it.

get_mfw_table(mfwords)[source]

Shortens the list to the given number of most frequent words and converts the word counts to relative frequencies.

This returns a new Corpus, the data in this object is not modified.

Parameters

mfwords (int) – number of most frequent words in the new corpus. 0 means all words.

Returns

a new sorted corpus shortened to mfwords

Return type

Corpus

top_n(mfwords)[source]

Returns a new Corpus that contains the top n features.

Parameters

mfwords (int) – Number of most frequent items in the new corpus.

Returns

a new corpus shortened to mfwords

Return type

Corpus

save_wordlist(filename, **kwargs)[source]

Saves the current word list to a text file.

Parameters
  • filename (str) – Path to the file to save

  • kwargs – Additional arguments to pass to open()

filter_wordlist(filename, **kwargs)[source]

Returns a new corpus that contains the features from the given file.

This method will read the list of words from the given file and then return a new corpus that uses the features listed in the file, in the order they are in the file.

Parameters

filename (str) – Path to the file to load. Each line contains one feature. Leading and trailing whitespace, lines starting with #, and empty lines are ignored.

Returns

New corpus with seelected features.

filter_features(features, **metadata)[source]

Returns a new corpus that contains only the given features.

Parameters

features (Iterable) – The features to select. If its in a file, use filter_wordlist

relative_frequencies()[source]
z_scores()[source]
cull(ratio=None, threshold=None, keepna=False)[source]

Removes all features that do not appear in a minimum number of documents.

Parameters
  • ratio (float) – Minimum ratio of documents a word must occur in to be retained. Note that we’re always rounding towards the ceiling, i.e. if the corpus contains 10 documents and ratio=1/3, a word must occur in at least 4 documents (if this is >= 1, it is interpreted as threshold)

  • threshold (int) – Minimum number of documents a word must occur in to be retained

  • keepna (bool) – If set to True, the missing words in the returned corpus will be retained as nan instead of 0.

Returns

A new corpus witht the culled words removed. The original

corpus is left unchanged.

Return type

Corpus

reparse(feature_generator, subdir=None, **kwargs)[source]

Parse or re-parse a set of documents with different settings.

This runs the given feature generator on the given or configured subdirectory. The feature vectors returned by the feature generator will replace or augment the corpus.

Parameters
  • feature_generator (FeatureGenerator) – Will be used for extracting stuff.

  • subdir (str) – If given, will be passed to the feature generator for processing. Otherwise, we’ll use the subdir configured with this corpus.

  • **kwargs – Additional metadata for the returned corpus.

Returns

a new corpus with the respective columns replaced or added.

The current object will be left unchanged.

Return type

Corpus

Raises

CorpusNotAbsolute – if called on a corpus with relative frequencies

tokens()pandas.core.series.Series[source]

Number tokens by text

types()pandas.core.series.Series[source]

Number of different features by text

ttr()float[source]

Type/token ratio for the whole corpus.

ttr_by_text()pandas.core.series.Series[source]

Type/token ratio for each text.

class delta.FeatureGenerator(lower_case: bool = False, encoding: str = 'utf-8', glob: str = '*.txt', skip: Optional[str] = None, token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0), max_tokens: Optional[int] = None, ngrams: Optional[int] = None, parallel: Union[int, bool, joblib.parallel.Parallel] = False, sort: str = 'documents', sparse: bool = False)[source]

Bases: object

A feature generator is responsible for converting a subdirectory of files into a feature matrix (that will then become a corpus). If you need to customize the feature extraction process, create a custom feature generator and pass it into your Corpus constructor call along with its subdir argument.

The default feature generator is able to process a directory of text files, tokenize each of the text files according to a regular expression, and count each token type for each file. To customize feature extraction, you have two options:

  1. for simple customizations, just create a new FeatureGenerator and set the constructor arguments accordingly. Look in the docstring for __init__() for details.

  2. in more complex cases, create a subclass and override methods as you see fit.

On a feature generator passed in to Corpus, only two methods will be called:

  • __call__(), i.e. the object as a callable, to actually generate

    the feature vector,

  • metadata to obtain metadata fields that will be included in

    the corresponding corpus.

So, if you wish to write a completely new feature generator, you can ignore the other methods.

Parameters
  • lower_case (bool) – if True, normalize all tokens to lower case before counting them

  • encoding (str) – the encoding to use when reading files

  • glob (str) – the pattern inside the subdirectory to find files.

  • skip (str) – don’t handle files that match this pattern

  • token_pattern (re.Regex) – The regular expression used to identify tokens. The default, LETTERS_PATTERN, will simply find sequences of unicode letters. WORD_PATTERN will find the shortest sequence of letters and apostrophes between two word boundaries (according to the simple word-boundary algorithm from Unicode regular expressions) that contains at least one letter.

  • max_tokens (int) – If set, stop reading each file after that many words.

  • ngrams (int) – Count token ngrams instead of single tokens

  • parallel (bool, int, Parallel) – If truish, read and parse files in parallel. The actual argument may be - None or False for no special processing - an int for the required number of jobs - a dictionary with Parallel arguments for finer control

  • sort (str) – Sort the final feature matrix by index before returning. Possible values: - documents, index: Sort by document names - features, columns: sort by feature labels (ie words) - both: sort along both axes - None or the empty string: Do not sort

  • sparse (bool) – build a sparse dataframe. Requires Pandas >=1.0

lower_case: bool = False
encoding: str = 'utf-8'
glob: str = '*.txt'
skip: Optional[str] = None
token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0)
max_tokens: Optional[int] = None
ngrams: Optional[int] = None
parallel: Union[int, bool, joblib.parallel.Parallel] = False
sort: str = 'documents'
sparse: bool = False
logger = <Logger delta.corpus.FeatureGenerator (WARNING)>
tokenize(lines)[source]

Tokenizes the given lines.

This method is called by count_tokens(). The default implementation will return an iterable of all tokens in the given :param:`lines` that matches the token_pattern. The result of this method can further be postprocessed by postprocess_tokens().

Parameters

lines – Iterable of strings in which to look for tokens.

Returns

Iterable (default implementation generator) of tokens

postprocess_tokens(tokens)[source]

Postprocesses the tokens according to the options provided when creating the feature generator..

Currently respects lower_case and ngrams. This is called by count_tokens after tokenizing.

Parameters

tokens – iterable of tokens as returned by tokenize()

Returns

iterable of postprocessed tokens

count_tokens(lines)[source]

This calls tokenize() to split the iterable lines into tokens. If the lower_case attribute is given, the tokens are then converted to lower_case. The tokens are counted, the method returns a pd.Series mapping each token to its number of occurrences.

This is called by process_file().

Parameters

lines – Iterable of strings in which to look for tokens.

Returns

maps tokens to the number of occurrences.

Return type

pandas.Series

get_name(filename)[source]

Converts a single file name to a label for the corresponding feature vector.

Returns

Feature vector label (filename w/o extension by default)

Return type

str

process_file(filename)[source]

Processes a single file to a feature vector.

The default implementation reads the file pointed to by filename as a text file, calls count_tokens() to create token counts and get_name() to calculate the label for the feature vector.

Parameters

filename (str) – The path to the file to process

Returns

Feature counts, its name set according to

get_name()

Return type

pd.Series

process_directory(directory)[source]

Iterates through the given directory and runs process_file() for each file matching glob in there.

Parameters

directory (str) – Path to the directory to process

Returns

mapping name to pd:Series

Return type

dict

property metadata

Returns: Metadata: metadata record that describes the parameters of the

features used for corpora created using this feature generator.

class delta.Normalization(f, name=None, title=None, register=True)[source]

Bases: object

Wrapper for normalizations.

delta.normalization(*args, **kwargs)[source]

Decorator that creates a Normalization from a function or (callable) object. Can be used without or with keyword arguments:

name (str): Name (identifier) for the normalization. By default, the function’s name is used. title (str): Human-readable title for the normalization.

class delta.DeltaFunction(f=None, descriptor=None, name=None, title=None, register=True)[source]

Bases: object

Abstract base class of a delta function.

To define a delta function, you have various options:

  1. subclass DeltaFunction and override its __call__() method with something that directly handles a Corpus.

  2. subclass DeltaFunction and override its distance() method with a distance function

  3. instantiate DeltaFunction and pass it a distance function, or use the delta() decorator

  4. use one of the subclasses

Creates a custom delta function.

Parameters
  • f (function) – a distance function that calculates the difference between two feature vectors and returns a float. If passed, this will be used for the implementation.

  • name (str) – The name/id of this function. Can be inferred from f or descriptor.

  • descriptor (str) – The descriptor to identify this function.

  • title (str) – A human-readable title for this function.

  • register (bool) – If true (default), register this delta function with the function registry on instantiation.

static distance(u, v, *args, **kwargs)[source]

Calculate a distance between two feature vectors.

This is an abstract method, you must either inherit from DeltaFunction and override distance or assign a function in order to use this.

Parameters
  • u (pandas.Series) – The documents to compare.

  • v (pandas.Series) – The documents to compare.

  • *args – Passed through from the caller

  • **kwargs

    Passed through from the caller

Returns

Distance between the documents.

Return type

float

Raises

NotImplementedError if no implementation is provided.

register()[source]

Registers this delta function with the global function registry.

iterate_distance(corpus, *args, **kwargs)[source]

Calculates the distance matrix for the given corpus.

The default implementation will iterate over all pairwise combinations of the documents in the given corpus and call distance() on each pair, passing on the additional arguments.

Clients may want to use __call__() instead, i.e. they want to call this object as a function.

Parameters
  • corpus (Corpus) – feature matrix for which to calculate the distance

  • *args – further arguments for the matrix

  • **kwargs

    further arguments for the matrix

Returns

square dataframe containing pairwise distances.

The default implementation will return a matrix that has zeros on the diagonal and the lower triangle a mirror of the upper triangle.

Return type

pandas.DataFrame

create_result(df, corpus)[source]

Wraps a square dataframe to a DistanceMatrix, adding appropriate metadata from corpus and this delta function.

Parameters
Returns

df as values, appropriate metadata

Return type

DistanceMatrix

prepare(corpus)[source]

Return the corpus prepared for the metric, if applicable.

Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.

If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.

The default implementation simply returns the corpus as-is.

Raises

NotImplementedError if there is no metric

class delta.PDistDeltaFunction(metric, name=None, title=None, register=True, scale=False, **kwargs)[source]

Bases: delta.deltas.DeltaFunction

Wraps one of the metrics implemented by ssd.pdist() as a delta function.

Warning

You should use MetricDeltaFunction instead.

Parameters
  • metric (str) – The metric that should be called via ssd.pdist

  • name (str) – Name / Descriptor for the delta function, if None, metric is used

  • title (str) – Human-Readable Title

  • register (bool) – If false, don’t register this with the registry

  • **kwargs – passed on to ssd.pdist()

class delta.MetricDeltaFunction(metric, name=None, title=None, register=True, scale=False, fix_symmetry=True, **kwargs)[source]

Bases: delta.deltas.DeltaFunction

Distance functions based on scikit-learn’s sklearn.metric.pairwise_distances().

Parameters
  • metric (str) – The metric that should be called via sklearn.metric.pairwise_distances

  • name (str) – Name / Descriptor for the delta function, if None, metric is used

  • title (str) – Human-Readable Title

  • register (bool) – If false, don’t register this with the registry

  • scale (bool) – Scale by number of features

  • fix_symmetry – Force the resulting matrix to be symmetric

  • **kwargs – passed on to ssd.pdist()

Note

sklearn.metric.pairwise_distances() fast, but the result may not be exactly symmetric. The fix_symmetry option enforces symmetry by mirroring the lower-left triangle after calculating distances so, e.g., scipy clustering won’t complain.

class delta.CompositeDeltaFunction(descriptor, name=None, title=None, register=True)[source]

Bases: delta.deltas.DeltaFunction

A composite delta function consists of a basis (which is another delta function) and a list of normalizations. It first transforms the corpus via all the given normalizations in order, and then runs the basis on the result.

Creates a new composite delta function.

Parameters
  • descriptor (str) – Formally defines this delta function. First the name of an existing, registered distance function, then, separated by -, the names of normalizations to run, in order.

  • name (str) – Name by which this delta function is registered, in addition to the descriptor

  • title (str) – human-readable title

  • register (bool) – If true (the default), register this delta function on creation

prepare(corpus)[source]

Return the corpus prepared for the metric, if applicable.

Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.

If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.

The default implementation simply returns the corpus as-is.

Raises

NotImplementedError if there is no metric

class delta.Clustering(distance_matrix, method='ward', **kwargs)[source]

Bases: object

Represents a hierarchical clustering.

Note

This is subject to refactoring once we implement more clustering methods

fclustering()[source]

Returns a default flat clustering from the hierarchical version.

This method uses the DocumentDescriber to determine the groups, and uses the number of groups as a maxclust criterion.

Returns

A properly initialized representation of the flat clustering.

Return type

FlatClustering

class delta.FlatClustering(distances, clusters=None, metadata=None, **kwargs)[source]

Bases: object

A flat clustering represents a non-hierarchical clustering.

Notes

FlatClustering uses a data frame field called data to store the actual clustering. This field will have the same index as the distance matrix, and three columns labeled Group, GroupID, and Cluster. Group will be the group label returned by the DocumentDescriber we use, GroupID a numerical ID for each group (to be used as ground truth) and Cluster the numerical ID of the actual cluster associated by the clustering algorithm.

As long as FlatClusterings initialized property is False, the Clustering is not assigned yet.

set_clusters(clusters)[source]
static ngroups(df)[source]

With df being a data frame that has a Group column, return the number of different authors in df.

cluster_errors()[source]

Calculates the number of cluster errors by:

  1. calculating the total number of different authors in the set

  2. calling sch.fcluster to generate at most that many flat clusters

  3. for each of those clusters, the cluster errors are the number of authors in this cluster - 1

  4. sum of each cluster’s errors = result

purity()[source]

To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by $N$

entropy()[source]

Smaller entropy values suggest a better clustering.

adjusted_rand_index()[source]

Calculates the Adjusted Rand Index for the given flat clustering http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score

homogeneity_completeness_v_measure()[source]
evaluate()[source]
Returns

All scores for the current clustering

Return type

pandas.Series

clusters(labeled=False)[source]

Documents by cluster.

Parameters

labeled (bool) – If True, represent each document by its label as calculated by the DocumentDescriber. This is typically a human-readable, shortened description

Returns

Maps each cluster number to a list of documents.

Return type

dict

describe()[source]

Returns a description of the current flat clustering.

delta.get_rfe_features(corpus, estimator=None, steps=[(10000, 1000), (1000, 200), (500, 25)], cv=True)[source]
Parameters
  • corpus – containing document_describer,

  • estimator – supervised learning estimator,

  • steps – list of tuples (features_to_select, step)

  • cv – additional cross-validated selection.

Returns

set of selected terms.

Return type

rfe_terms

class delta.Dendrogram(clustering, describer=None, ax=None, orientation='left', font_size=None, link_color='k', title='Corpus: {corpus}', xlabel='Delta: {delta_title}, {words} most frequent {features}')[source]

Bases: object

Creates a dendrogram representation from a hierarchical clustering.

This is a wrapper around, and an improvement to, sch.dendrogram(), tailored for the use in pydelta.

Parameters
  • clustering (Clustering) – A hierarchical clustering.

  • describer (DocumentDescriber) – Document describer used for determining the groups and the labels for the documents used (optional). By default, the document describer inherited from the clustering is used.

  • ax (mpl.axes.Axes) – Axes object to draw on. Uses pyplot default axes if not provided.

  • orientation (str) – Orientation of the dendrogram. Currently, only “right” is supported (default).

  • font_size – Font size for the label, in points. If not provided, sch.dendrogram() calculates a default.

  • link_color (str) – The color used for the links in the dendrogram, by default k (for black).

  • title (str) – a title that will be printed on the plot. The string may be a template string as supported by str.format_map() with metadata field names in curly braces, it will be evaluated against the clustering’s metadata. If you pass None here, no title will be added.

Notes

The dendrogram will be painted by matplotlib / pyplot using the default styles, which means you can use, e.g., :module:`seaborn` to influence the overall design of the image.

Dendrogram handles coloring differently than sch.dendrogram(): It will color the document labels according to the pre-assigned grouping (e.g., by author). To do so, it will build on matplotlib’s default color_cycle, and it will rotate, so if you need more colors, adjust the color_cycle accordingly.

show()[source]
save(fname, **kwargs)[source]
delta.compare_pairwise(df, comparisons=None)[source]

Builds a table with pairwise comparisons of specific columns in the dataframe df.

This function is intended to provide additional relative metadata to the pairwise distances of a (symmetric) DistanceMatrix. It will take a dataframe and compare its rows pairwise according to the second argument, returning a dataframe in the ‘vector’ form of :func:ssd.squareform.

If your comparisons can be expressed as np.ufuncs, this will be quite efficient.

Parameters
  • df – A dataframe. rows = instances, columns = features.

  • comparisons

    A list of comparison specs. Each spec should be either:

    1. a column name (e.g., a string) for default settings: The absolute difference (np.subtract) for numerical columns, np.equal for everything else

    2. a tuple with 2-4 entries: (source_column, ufunc [, postfunc: callable] [, target_column: str])

      • source column is the name of the column in df to compare

      • ufunc is a two-argument :class:np.ufunc which is pairwise applied to all combinations of the column

      • postfunc is a one-argument function that is applied to the final, 1D result vector

      • target_column is the name of the column in the result dataframe (if missing, source column will be used)

    If comparisons is missing, a default comparison will be created for every column

Returns

A dataframe. Will have a column for each comparison spec and a row for each unique pair in the index. The order of rows will be similar to [(i, j) for i in 0..(n-1) for j in (i+1)..(n-1)].

Example

>>> df = pd.DataFrame({'Class': ['a', 'a', 'b'], 'Size': [42, 30, 5]})
>>> compare_pairwise(df)
     Class  Size
0 1   True    12
  2  False    37
1 2  False    25
>>> compare_pairwise(df, ['Class', ('Size', np.subtract, np.absolute, 'Size_Diff'), ('Size', np.add, 'Size_Total')])
     Class  Size_Diff  Size_Total
0 1   True         12          72
  2  False         37          47
1 2  False         25          35
class delta.Metadata(*args, **kwargs)[source]

Bases: collections.abc.Mapping

A metadata record contains information about how a particular object of the pyDelta universe has been constructed, or how it will be manipulated.

Metadata fields are simply attributes, and they can be used as such.

Create a new metadata instance. Arguments will be passed on to update().

Examples

>>> m = Metadata(lower_case=True, sorted=False)
>>> Metadata(m, sorted=True, words=5000)
Metadata(lower_case=True, sorted=True, words=5000)
update(*args, **kwargs)[source]

Updates this metadata record from the arguments. Arguments may be:

  • other Metadata instances

  • objects that have metadata attribute

  • JSON strings

  • stuff that dict can update from

  • key-value pairs of new or updated metadata fields

static metafilename(filename)[source]

Returns an appropriate metadata filename for the given filename.

>>> Metadata.metafilename("foo.csv")
'foo.csv.meta'
>>> Metadata.metafilename("foo.csv.meta")
'foo.csv.meta'
classmethod load(filename)[source]

Loads a metadata instance from the filename identified by the argument.

Parameters

filename (str) – The name of the metadata file, or of the file to which a sidecar metadata filename exists

save(filename, **kwargs)[source]

Saves the metadata instance to a JSON file.

Parameters
  • filename (str) – Name of the metadata file or the source file

  • **kwargs – are passed on to json.dump()

to_json(**kwargs)[source]

Returns a JSON string containing this metadata object’s contents.

Parameters

**kwargs – Arguments passed to json.dumps()

class delta.TableDocumentDescriber(table, group_col, name_col, dialect='excel', **kwargs)[source]

Bases: delta.util.DocumentDescriber

A document decriber that takes groups and item labels from an external table.

Parameters
  • table (str or pandas.DataFrame) – A table with metadata that describes the documents of the corpus, either a pandas.DataFrame or path or IO to a CSV file. The tables index (or first column for CSV files) contains the document ids that are returned by the FeatureGenerator. The columns (or first row) contains column labels.

  • group_col (str) – Name of the column in the table that contains the names of the groups. Will be used, e.g., for determining the ground truth for cluster evaluation, and for coloring the dendrograms.

  • name_col (str) – Name of the column in the table that contains the names of the individual items.

  • dialect (str or csv.Dialect) – CSV dialect to use for reading the file.

  • **kwargs – Passed on to pandas.read_table().

Raises

ValueError – when arguments inconsistent

See:

pandas.read_table

group_name(document_name)[source]

Returns the unique name of the group the document belongs to.

The default implementation returns the part of the document name before the first _.

item_name(document_name)[source]

Returns the name of the item within the group.

The default implementation returns the part of the document name after the first _.

class delta.KMedoidsClustering(corpus, delta, n_clusters=None, extra_args={}, metadata=None, **kwargs)[source]

Bases: delta.cluster.FlatClustering