delta package¶
Submodules¶
delta.cluster module¶
Clustering of distance matrixes.
Clustering
represents a hierarchical clustering which can be flattened
using Clustering.fcluster()
, the flattened clustering is then represented
by FlatClustering
.
If supported by the installed version of scikit-learn, there is also a KMedoidsClustering.
- class delta.cluster.Clustering(distance_matrix, method='ward', **kwargs)[source]¶
Bases:
object
Represents a hierarchical clustering.
Note
This is subject to refactoring once we implement more clustering methods
- class delta.cluster.FlatClustering(distances, clusters=None, metadata=None, **kwargs)[source]¶
Bases:
object
A flat clustering represents a non-hierarchical clustering.
Notes
FlatClustering uses a data frame field called
data
to store the actual clustering. This field will have the same index as the distance matrix, and three columns labeledGroup
,GroupID
, andCluster
.Group
will be the group label returned by theDocumentDescriber
we use,GroupID
a numerical ID for each group (to be used as ground truth) andCluster
the numerical ID of the actual cluster associated by the clustering algorithm.As long as FlatClusterings
initialized
property isFalse
, the Clustering is not assigned yet.- static ngroups(df)[source]¶
With df being a data frame that has a Group column, return the number of different authors in df.
- cluster_errors()[source]¶
Calculates the number of cluster errors by:
calculating the total number of different authors in the set
calling sch.fcluster to generate at most that many flat clusters
for each of those clusters, the cluster errors are the number of authors in this cluster - 1
sum of each cluster’s errors = result
- purity()[source]¶
To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by $N$
- adjusted_rand_index()[source]¶
Calculates the Adjusted Rand Index for the given flat clustering http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score
- class delta.cluster.KMedoidsClustering_distances(distances, n_clusters=None, metadata=None, **kwargs)[source]¶
Bases:
delta.cluster.FlatClustering
- class delta.cluster.KMedoidsClustering(corpus, delta, n_clusters=None, extra_args={}, metadata=None, **kwargs)[source]¶
Bases:
delta.cluster.FlatClustering
delta.corpus module¶
The delta.corpus module contains code for building, loading, saving, and
manipulating the representation of a corpus. Its heart is the Corpus
class which represents the feature matrix. Also contained are default
implementations for reading and tokenizing files and creating a feature vector
out of that.
- class delta.corpus.FeatureGenerator(lower_case: bool = False, encoding: str = 'utf-8', glob: str = '*.txt', skip: Optional[str] = None, token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0), max_tokens: Optional[int] = None, ngrams: Optional[int] = None, parallel: Union[int, bool, joblib.parallel.Parallel] = False, sort: str = 'documents', sparse: bool = False)[source]¶
Bases:
object
A feature generator is responsible for converting a subdirectory of files into a feature matrix (that will then become a corpus). If you need to customize the feature extraction process, create a custom feature generator and pass it into your
Corpus
constructor call along with itssubdir
argument.The default feature generator is able to process a directory of text files, tokenize each of the text files according to a regular expression, and count each token type for each file. To customize feature extraction, you have two options:
for simple customizations, just create a new FeatureGenerator and set the constructor arguments accordingly. Look in the docstring for
__init__()
for details.in more complex cases, create a subclass and override methods as you see fit.
On a feature generator passed in to
Corpus
, only two methods will be called:__call__()
, i.e. the object as a callable, to actually generatethe feature vector,
metadata
to obtain metadata fields that will be included inthe corresponding corpus.
So, if you wish to write a completely new feature generator, you can ignore the other methods.
- Parameters
lower_case (bool) – if
True
, normalize all tokens to lower case before counting themencoding (str) – the encoding to use when reading files
glob (str) – the pattern inside the subdirectory to find files.
skip (str) – don’t handle files that match this pattern
token_pattern (re.Regex) – The regular expression used to identify tokens. The default, LETTERS_PATTERN, will simply find sequences of unicode letters. WORD_PATTERN will find the shortest sequence of letters and apostrophes between two word boundaries (according to the simple word-boundary algorithm from Unicode regular expressions) that contains at least one letter.
max_tokens (int) – If set, stop reading each file after that many words.
ngrams (int) – Count token ngrams instead of single tokens
parallel (bool, int, Parallel) – If truish, read and parse files in parallel. The actual argument may be - None or False for no special processing - an int for the required number of jobs - a dictionary with Parallel arguments for finer control
sort (str) – Sort the final feature matrix by index before returning. Possible values: -
documents
,index
: Sort by document names -features
,columns
: sort by feature labels (ie words) -both
: sort along both axes - None or the empty string: Do not sortsparse (bool) – build a sparse dataframe. Requires Pandas >=1.0
- token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0)¶
- logger = <Logger delta.corpus.FeatureGenerator (WARNING)>¶
- tokenize(lines)[source]¶
Tokenizes the given lines.
This method is called by
count_tokens()
. The default implementation will return an iterable of all tokens in the given :param:`lines` that matches thetoken_pattern
. The result of this method can further be postprocessed bypostprocess_tokens()
.- Parameters
lines – Iterable of strings in which to look for tokens.
- Returns
Iterable (default implementation generator) of tokens
- postprocess_tokens(tokens)[source]¶
Postprocesses the tokens according to the options provided when creating the feature generator..
Currently respects
lower_case
andngrams
. This is called by count_tokens after tokenizing.- Parameters
tokens – iterable of tokens as returned by
tokenize()
- Returns
iterable of postprocessed tokens
- count_tokens(lines)[source]¶
This calls
tokenize()
to split the iterablelines
into tokens. If thelower_case
attribute is given, the tokens are then converted to lower_case. The tokens are counted, the method returns apd.Series
mapping each token to its number of occurrences.This is called by
process_file()
.- Parameters
lines – Iterable of strings in which to look for tokens.
- Returns
maps tokens to the number of occurrences.
- Return type
- get_name(filename)[source]¶
Converts a single file name to a label for the corresponding feature vector.
- Returns
Feature vector label (filename w/o extension by default)
- Return type
- process_file(filename)[source]¶
Processes a single file to a feature vector.
The default implementation reads the file pointed to by
filename
as a text file, callscount_tokens()
to create token counts andget_name()
to calculate the label for the feature vector.- Parameters
filename (str) – The path to the file to process
- Returns
- Feature counts, its name set according to
- Return type
pd.Series
- process_directory(directory)[source]¶
Iterates through the given directory and runs
process_file()
for each file matchingglob
in there.
- property metadata¶
Returns: Metadata: metadata record that describes the parameters of the
features used for corpora created using this feature generator.
- class delta.corpus.SimpleFeatureGenerator(lower_case: bool = False, encoding: str = 'utf-8', glob: str = '*.txt', skip: Optional[str] = None, token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0), max_tokens: Optional[int] = None, ngrams: Optional[int] = None, parallel: Union[int, bool, joblib.parallel.Parallel] = False, sort: str = 'documents', sparse: bool = False)[source]¶
Bases:
delta.corpus.FeatureGenerator
A simplified, faster version of the FeatureGenerator.
With respect to feature generation the behaviour is the same as with FeatureGenerator, but it is slightly less flexible with respect to subclassing. It does not read the files linewise, and it never creates pd.Series().
- postprocess_tokens(tokens)[source]¶
Postprocesses the tokens according to the options provided when creating the feature generator..
Currently respects
lower_case
andngrams
. This is called by count_tokens after tokenizing.- Parameters
tokens – iterable of tokens as returned by
tokenize()
- Returns
iterable of postprocessed tokens
- process_file(filename)[source]¶
Processes a single file to a feature vector.
The default implementation reads the file pointed to by
filename
as a text file, callscount_tokens()
to create token counts andget_name()
to calculate the label for the feature vector.- Parameters
filename (str) – The path to the file to process
- Returns
- Feature counts, its name set according to
get_name()
- Return type
pd.Series
- exception delta.corpus.CorpusNotComplete(msg='Corpus not complete anymore')[source]¶
Bases:
ValueError
- class delta.corpus.Corpus(source=None, *, subdir=None, file=None, corpus=None, feature_generator=None, document_describer=<delta.util.DefaultDocumentDescriber object>, metadata=None, **kwargs)[source]¶
Bases:
pandas.core.frame.DataFrame
Creates a new Corpus.
You can create a corpus either from a filesystem subdir with raw text files, or from a CSV file with a document-term matrix, or from another corpus or dataframe that contains (potentially preprocessed) document/term vectors. Either option may be passed via appropriately named keyword argument or as the only positional argument, but exactly one must be present.
If you pass a subdirectory, Corpus will call a
FeatureGenerator
to read and parse the files and to generate a default word count. The default implementation will search for plain text files*.txt
inside the directory and parse them using a simple regular expression. It has a few options, e.g.,glob
andlower_case
, that can also be passed directly to corpus as keyword arguments. E.g.,Corpus('texts', glob='plain-*.txt', lower_case=True)
will look for files called plain-xxx.txt and convert it to lower case before tokenizing. SeeFeatureGenerator
for more details.The
document_describer
can contain per-document metadata which can be used, e.g, as ground truth.The
metadata
record contains global metadata (e.g., which transformations have already been performed), they will be inherited from acorpus
argument, all additional keyword arguments will be included with this record.- Parameters
source – Positional variant of either subdir, file, or corpus
- Keyword Arguments
subdir (str) – Path to a subdirectory containing the (unprocessed) corpus data.
file (str) – Path to a CSV file containing the feature vectors.
corpus (pandas.DataFrame) – A dataframe or
Corpus
from which to create a new corpus, as a copy.feature_generator (FeatureGenerator) – A customizeable helper class that will process a
subdir
to a feature matrix, if thesubdir
argument is also given. If None, a default feature generator will be used.metadata (dict) – A dictionary with metadata to copy into the new corpus.
**kwargs – Additionally, if feature_generator is None and subdir is not None, you can pass FeatureGenerator arguments and they will be used when instantiating the feature generator Additional keyword arguments will be set in the metadata record of the new corpus.
Warning
You should either use a single positional argument (source) or one of subdir, file, or corpus as keyword arguments. In future versions, source will be positional-only.
- new_data(data, **metadata)[source]¶
Wraps the given
DataFrame
with metadata from this corpus object.- Parameters
data (pandas.DataFrame) – Raw data that is derived by, e.g., pandas filter operations
**metadata – Metadata fields that should be changed / modified
- save(filename='corpus_words.csv')[source]¶
Saves the corpus to a CSV file.
The corpus will be saved to a CSV file containing documents in the columns and features in the rows, i.e. a transposed representation. Document and feature labels will be saved to the first row or column, respectively.
A metadata file will be saved alongside the file.
- Parameters
filename (str) – The target file.
- is_absolute() → bool[source]¶
- Returns
True
if this is a corpus using absolute frequencies- Return type
- is_complete() → bool[source]¶
A corpus is complete as long as it contains the absolute frequencies of all features of all documents. Many operations like calculating the relative frequencies require a complete corpus. Once a corpus has lost its completeness, it is not possible to restore it.
- get_mfw_table(mfwords)[source]¶
Shortens the list to the given number of most frequent words and converts the word counts to relative frequencies.
This returns a new
Corpus
, the data in this object is not modified.- Parameters
mfwords (int) – number of most frequent words in the new corpus. 0 means all words.
See also
- Returns
a new sorted corpus shortened to
mfwords
- Return type
- filter_wordlist(filename, **kwargs)[source]¶
Returns a new corpus that contains the features from the given file.
This method will read the list of words from the given file and then return a new corpus that uses the features listed in the file, in the order they are in the file.
- Parameters
filename (str) – Path to the file to load. Each line contains one feature. Leading and trailing whitespace, lines starting with
#
, and empty lines are ignored.- Returns
New corpus with seelected features.
- filter_features(features, **metadata)[source]¶
Returns a new corpus that contains only the given features.
- Parameters
features (Iterable) – The features to select. If its in a file, use filter_wordlist
- cull(ratio=None, threshold=None, keepna=False)[source]¶
Removes all features that do not appear in a minimum number of documents.
- Parameters
ratio (float) – Minimum ratio of documents a word must occur in to be retained. Note that we’re always rounding towards the ceiling, i.e. if the corpus contains 10 documents and ratio=1/3, a word must occur in at least 4 documents (if this is >= 1, it is interpreted as threshold)
threshold (int) – Minimum number of documents a word must occur in to be retained
keepna (bool) – If set to True, the missing words in the returned corpus will be retained as
nan
instead of0
.
- Returns
- A new corpus witht the culled words removed. The original
corpus is left unchanged.
- Return type
- reparse(feature_generator, subdir=None, **kwargs)[source]¶
Parse or re-parse a set of documents with different settings.
This runs the given feature generator on the given or configured subdirectory. The feature vectors returned by the feature generator will replace or augment the corpus.
- Parameters
feature_generator (FeatureGenerator) – Will be used for extracting stuff.
subdir (str) – If given, will be passed to the feature generator for processing. Otherwise, we’ll use the subdir configured with this corpus.
**kwargs – Additional metadata for the returned corpus.
- Returns
- a new corpus with the respective columns replaced or added.
The current object will be left unchanged.
- Return type
- Raises
CorpusNotAbsolute – if called on a corpus with relative frequencies
- tokens() → pandas.core.series.Series[source]¶
Number tokens by text
- types() → pandas.core.series.Series[source]¶
Number of different features by text
- ttr_by_text() → pandas.core.series.Series[source]¶
Type/token ratio for each text.
delta.deltas module¶
This module contains the actual delta measures.
Normalizations¶
A normalization is a function that works on a Corpus
and returns a
somewhat normalized version of that corpus. Each normalization has the
following additional attributes:
name – an identifier for the normalization, usually the function name
title – an optional, human-readable name for the normalization
Each normalization leaves its name in the ‘normalizations’ field of the corpus’
Metadata
.
All available normalizations need to be registered to the normalization registry.
Delta Functions¶
A delta function takes a Corpus
and creates a Distances
table from that. Each delta function has the following properties:
- descriptor – a systematic descriptor of the distance function. For simple
delta functions (see below), this is simply the name. For composite distance functions, this starts with the name of a simple delta function and is followed by a list of normalizations (in order) that are applied to the corpus before applying the distance function.
name – a unique name for the distance function
title – an optional, human-readable name for the distance function.
Simple Delta Functions¶
Simple delta functions are functions that
- class delta.deltas.Normalization(f, name=None, title=None, register=True)[source]¶
Bases:
object
Wrapper for normalizations.
- delta.deltas.normalization(*args, **kwargs)[source]¶
Decorator that creates a
Normalization
from a function or (callable) object. Can be used without or with keyword arguments:name (str): Name (identifier) for the normalization. By default, the function’s name is used. title (str): Human-readable title for the normalization.
- class delta.deltas.DeltaFunction(f=None, descriptor=None, name=None, title=None, register=True)[source]¶
Bases:
object
Abstract base class of a delta function.
To define a delta function, you have various options:
subclass DeltaFunction and override its
__call__()
method with something that directly handles aCorpus
.subclass DeltaFunction and override its
distance()
method with a distance functioninstantiate DeltaFunction and pass it a distance function, or use the
delta()
decoratoruse one of the subclasses
Creates a custom delta function.
- Parameters
f (function) – a distance function that calculates the difference between two feature vectors and returns a float. If passed, this will be used for the implementation.
name (str) – The name/id of this function. Can be inferred from
f
ordescriptor
.descriptor (str) – The descriptor to identify this function.
title (str) – A human-readable title for this function.
register (bool) – If true (default), register this delta function with the function registry on instantiation.
- static distance(u, v, *args, **kwargs)[source]¶
Calculate a distance between two feature vectors.
This is an abstract method, you must either inherit from DeltaFunction and override distance or assign a function in order to use this.
- Parameters
u (pandas.Series) – The documents to compare.
v (pandas.Series) – The documents to compare.
*args – Passed through from the caller
**kwargs –
Passed through from the caller
- Returns
Distance between the documents.
- Return type
- Raises
NotImplementedError if no implementation is provided. –
- iterate_distance(corpus, *args, **kwargs)[source]¶
Calculates the distance matrix for the given corpus.
The default implementation will iterate over all pairwise combinations of the documents in the given corpus and call
distance()
on each pair, passing on the additional arguments.Clients may want to use
__call__()
instead, i.e. they want to call this object as a function.- Parameters
corpus (Corpus) – feature matrix for which to calculate the distance
*args – further arguments for the matrix
**kwargs –
further arguments for the matrix
- Returns
- square dataframe containing pairwise distances.
The default implementation will return a matrix that has zeros on the diagonal and the lower triangle a mirror of the upper triangle.
- Return type
- create_result(df, corpus)[source]¶
Wraps a square dataframe to a DistanceMatrix, adding appropriate metadata from corpus and this delta function.
- Parameters
df (pandas.DataFrame) – Distance matrix like created by
iterate_distance()
corpus (Corpus) – source feature matrix
- Returns
df as values, appropriate metadata
- Return type
- prepare(corpus)[source]¶
Return the corpus prepared for the metric, if applicable.
Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.
If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.
The default implementation simply returns the corpus as-is.
- Raises
NotImplementedError if there is no metric –
- class delta.deltas.PreprocessingDeltaFunction(distance_function, prep_function, descriptor=None, name=None, title=None, register=True)[source]¶
Bases:
delta.deltas.DeltaFunction
Creates a custom delta function.
- Parameters
f (function) – a distance function that calculates the difference between two feature vectors and returns a float. If passed, this will be used for the implementation.
name (str) – The name/id of this function. Can be inferred from
f
ordescriptor
.descriptor (str) – The descriptor to identify this function.
title (str) – A human-readable title for this function.
register (bool) – If true (default), register this delta function with the function registry on instantiation.
- class delta.deltas.CompositeDeltaFunction(descriptor, name=None, title=None, register=True)[source]¶
Bases:
delta.deltas.DeltaFunction
A composite delta function consists of a basis (which is another delta function) and a list of normalizations. It first transforms the corpus via all the given normalizations in order, and then runs the basis on the result.
Creates a new composite delta function.
- Parameters
descriptor (str) – Formally defines this delta function. First the name of an existing, registered distance function, then, separated by
-
, the names of normalizations to run, in order.name (str) – Name by which this delta function is registered, in addition to the descriptor
title (str) – human-readable title
register (bool) – If true (the default), register this delta function on creation
- prepare(corpus)[source]¶
Return the corpus prepared for the metric, if applicable.
Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.
If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.
The default implementation simply returns the corpus as-is.
- Raises
NotImplementedError if there is no metric –
- class delta.deltas.PDistDeltaFunction(metric, name=None, title=None, register=True, scale=False, **kwargs)[source]¶
Bases:
delta.deltas.DeltaFunction
Wraps one of the metrics implemented by
ssd.pdist()
as a delta function.Warning
You should use MetricDeltaFunction instead.
- class delta.deltas.MetricDeltaFunction(metric, name=None, title=None, register=True, scale=False, fix_symmetry=True, **kwargs)[source]¶
Bases:
delta.deltas.DeltaFunction
Distance functions based on scikit-learn’s
sklearn.metric.pairwise_distances()
.- Parameters
metric (str) – The metric that should be called via sklearn.metric.pairwise_distances
name (str) – Name / Descriptor for the delta function, if None, metric is used
title (str) – Human-Readable Title
register (bool) – If false, don’t register this with the registry
scale (bool) – Scale by number of features
fix_symmetry – Force the resulting matrix to be symmetric
**kwargs – passed on to
ssd.pdist()
Note
sklearn.metric.pairwise_distances()
fast, but the result may not be exactly symmetric. Thefix_symmetry
option enforces symmetry by mirroring the lower-left triangle after calculating distances so, e.g., scipy clustering won’t complain.
- class delta.deltas.DistanceMatrix(df, copy_from=None, metadata=None, corpus=None, document_describer=None, **kwargs)[source]¶
Bases:
pandas.core.frame.DataFrame
- delta_values(transpose=False, check=True)[source]¶
Converts the given n×n Delta matrix to a \(\binom{n}{2}\) long series of distinct delta values – i.e. duplicates from the upper triangle and zeros from the diagonal are removed.
- Parameters
transpose – if True, transpose the dataframe first, i.e. use the upper right triangle
check – if True and if the result does not contain any non-null value, try the other option for transpose.
- delta_values_df()[source]¶
Returns an stacked form of the given delta table along with additional metadata. Assumes delta is symmetric.
The dataframe returned has the columns Author1, Author2, Text1, Text2, and Delta, it has an entry for every unique combination of texts
- f_ratio()[source]¶
Calculates the (normalized) F-ratio over the distance matrix, according to Heeringa et al.
Checks whether the distances within a group (i.e., texts with the same author) are much smaller thant the distances between groups
- fisher_ld()[source]¶
Calculates Fisher’s Linear Discriminant for the distance matrix.
cf. Heeringa et al.
- partition()[source]¶
Splits this distance matrix into two sparse halves: the first contains only the differences between documents that are in the same group (‘in-group’), the second only the differences between documents that are in different groups.
Group associations are created according to the
DocumentDescriber
.- Returns
(in_group, out_group)
- Return type
- simple_score()[source]¶
Simple delta quality score for the given delta matrix:
The difference between the means of the standardized differences between works of different authors and works of the same author; i.e. different authors are considered score standard deviations more different than equal authors.
- compare_with(doc_metadata, comparisons=None, join='inner')[source]¶
Compare the distance matrix value with values calculated from the given document metadata table.
- Parameters
doc_metadata (pd.DataFrame) – a dataframe with one row per document and arbitrary columns
comparisons – see
compare_pairwise
join (str) – inner (the default) or outer, if outer, keep pairs for which we have neither metadata nor comparisons.
- Returns
a dataframe with a row for each pairwise document combination (as in
DistanceMatrix.delta_values
). The first column will contain the delta values, subsequent columns the metadata comparisons.
delta.experiments module¶
The experiments module can be used to perform a series of experiments in which you vary some of the arguments. Here’s the basic data model:
A _Facet_ is an aspect you wish to vary, e.g. the number of features. A facet delivers a set of _expressions_. Each expression represents the actual values of the facet, eg, “3000 most frequent words” might be an expression of the facet ‘number of features’.
There are some different kinds of facets:
A _corpus builder facet_ determines how the actual corpus is built. The corpus
builder facets are used to actually assemble a constructor call to the
delta.Corpus
class, i.e. for every combination of expressions we get a
new Corpus. Thus, variation here may be quite lengthy.
A _corpus manipulation facet_ takes an existing corpus and manipulates it, e.g., by extracting the n most frequent words. This is much faster then building the corpus anew each time, so if you can, implement a corpus manipulation facet instead of a corpus builder one.
A _method facet_ delivers a delta function that should be manipulated.
delta.features module¶
Feature selection utilities.
- delta.features.get_rfe_features(corpus, estimator=None, steps=[(10000, 1000), (1000, 200), (500, 25)], cv=True)[source]¶
- Parameters
corpus – containing document_describer,
estimator – supervised learning estimator,
steps – list of tuples (features_to_select, step)
cv – additional cross-validated selection.
- Returns
set of selected terms.
- Return type
rfe_terms
delta.graphics module¶
Various visualization tools.
- class delta.graphics.Dendrogram(clustering, describer=None, ax=None, orientation='left', font_size=None, link_color='k', title='Corpus: {corpus}', xlabel='Delta: {delta_title}, {words} most frequent {features}')[source]¶
Bases:
object
Creates a dendrogram representation from a hierarchical clustering.
This is a wrapper around, and an improvement to,
sch.dendrogram()
, tailored for the use in pydelta.- Parameters
clustering (Clustering) – A hierarchical clustering.
describer (DocumentDescriber) – Document describer used for determining the groups and the labels for the documents used (optional). By default, the document describer inherited from the clustering is used.
ax (mpl.axes.Axes) – Axes object to draw on. Uses pyplot default axes if not provided.
orientation (str) – Orientation of the dendrogram. Currently, only “right” is supported (default).
font_size – Font size for the label, in points. If not provided,
sch.dendrogram()
calculates a default.link_color (str) – The color used for the links in the dendrogram, by default
k
(for black).title (str) – a title that will be printed on the plot. The string may be a template string as supported by
str.format_map()
with metadata field names in curly braces, it will be evaluated against the clustering’s metadata. If you passNone
here, no title will be added.
Notes
The dendrogram will be painted by matplotlib / pyplot using the default styles, which means you can use, e.g., :module:`seaborn` to influence the overall design of the image.
Dendrogram
handles coloring differently thansch.dendrogram()
: It will color the document labels according to the pre-assigned grouping (e.g., by author). To do so, it will build on matplotlib’s default color_cycle, and it will rotate, so if you need more colors, adjust the color_cycle accordingly.
- delta.graphics.scatterplot_delta(deltas, red_f=MDS(dissimilarity='precomputed', n_jobs=- 1))[source]¶
deltas: pydelta dist. matrix red_f: func for dimensionality reduction, e.g. “decomposition.PCA(n_components=2)”
return: plot?
- delta.graphics.spikeplot(corpus, docs=slice(None, None, None), features=50, figsize=None, **kwargs)[source]¶
Prepares a spike plot of a (normalized) corpus.
- Parameters
corpus (pandas.DataFrame) – The corpus to plot
docs (int, list or slice) – the documents to include in the plot, default: all documents
features (int, list, or slice) – the features to plot, default: top 50 features
figsize (2-element list) – size of the plot
kwargs – will be passed on to
pd.DataFrame.plot()
Notes
The arguments docs and features can by either: * None, selecting all items * something you would put into corpus.index[·] or corpus.columns[·], respectively; i.e. a label indexer * an integer, selecting the first n items * a list of integers, selecting exactly those items
- Returns
the plot
delta.util module¶
Contains utility classes and functions.
- class delta.util.Metadata(*args, **kwargs)[source]¶
Bases:
collections.abc.Mapping
A metadata record contains information about how a particular object of the pyDelta universe has been constructed, or how it will be manipulated.
Metadata fields are simply attributes, and they can be used as such.
Create a new metadata instance. Arguments will be passed on to
update()
.Examples
>>> m = Metadata(lower_case=True, sorted=False) >>> Metadata(m, sorted=True, words=5000) Metadata(lower_case=True, sorted=True, words=5000)
- static metafilename(filename)[source]¶
Returns an appropriate metadata filename for the given filename.
>>> Metadata.metafilename("foo.csv") 'foo.csv.meta' >>> Metadata.metafilename("foo.csv.meta") 'foo.csv.meta'
- classmethod load(filename)[source]¶
Loads a metadata instance from the filename identified by the argument.
- Parameters
filename (str) – The name of the metadata file, or of the file to which a sidecar metadata filename exists
- save(filename, **kwargs)[source]¶
Saves the metadata instance to a JSON file.
- Parameters
filename (str) – Name of the metadata file or the source file
**kwargs – are passed on to
json.dump()
- to_json(**kwargs)[source]¶
Returns a JSON string containing this metadata object’s contents.
- Parameters
**kwargs – Arguments passed to
json.dumps()
- class delta.util.DocumentDescriber[source]¶
Bases:
object
DocumentDescribers are able to extract metadata from the document IDs of a corpus.
The idea is that a
Corpus
contains some sort of document name (e.g., original filenames), however, some components would be interested in information inferred from metadata. A DocumentDescriber will be able to produce this information from the document name, be it by inferring it directly (e.g., using some filename policy) or by using an external database.This base implementation expects filenames of the format “Author_Title.ext” and returns author names as groups and titles as in-group labels.
The
DefaultDocumentDescriber
adds author and title shortening, and we plan a metadata basedTableDocumentDescriber
that uses an external metadata table.- group_name(document_name)[source]¶
Returns the unique name of the group the document belongs to.
The default implementation returns the part of the document name before the first
_
.
- item_name(document_name)[source]¶
Returns the name of the item within the group.
The default implementation returns the part of the document name after the first
_
.
- group_label(document_name)[source]¶
Returns a (maybe shortened) label for the group, for display purposes.
The default implementation just returns the
group_name()
.
- item_label(document_name)[source]¶
Returns a (maybe shortened) label for the item within the group, for display purposes.
The default implementation just returns the
item_name()
.
- class delta.util.DefaultDocumentDescriber[source]¶
Bases:
delta.util.DocumentDescriber
- class delta.util.TableDocumentDescriber(table, group_col, name_col, dialect='excel', **kwargs)[source]¶
Bases:
delta.util.DocumentDescriber
A document decriber that takes groups and item labels from an external table.
- Parameters
table (str or pandas.DataFrame) – A table with metadata that describes the documents of the corpus, either a
pandas.DataFrame
or path or IO to a CSV file. The tables index (or first column for CSV files) contains the document ids that are returned by theFeatureGenerator
. The columns (or first row) contains column labels.group_col (str) – Name of the column in the table that contains the names of the groups. Will be used, e.g., for determining the ground truth for cluster evaluation, and for coloring the dendrograms.
name_col (str) – Name of the column in the table that contains the names of the individual items.
dialect (str or
csv.Dialect
) – CSV dialect to use for reading the file.**kwargs – Passed on to
pandas.read_table()
.
- Raises
ValueError – when arguments inconsistent
- See:
pandas.read_table
- delta.util.ngrams(iterable, n=2, sep=None)[source]¶
Transforms an iterable into an iterable of ngrams.
- Parameters
- Yields
if sep is None, this yields n-tuples of the iterable. If sep is a string, it is used to join the tuples
Example
>>> list(ngrams('This is a test'.split(), n=2, sep=' ')) ['This is', 'is a', 'a test']
- delta.util.compare_pairwise(df, comparisons=None)[source]¶
Builds a table with pairwise comparisons of specific columns in the dataframe df.
This function is intended to provide additional relative metadata to the pairwise distances of a (symmetric) DistanceMatrix. It will take a dataframe and compare its rows pairwise according to the second argument, returning a dataframe in the ‘vector’ form of
:func:ssd.squareform
.If your comparisons can be expressed as np.ufuncs, this will be quite efficient.
- Parameters
df – A dataframe. rows = instances, columns = features.
comparisons –
A list of comparison specs. Each spec should be either:
a column name (e.g., a string) for default settings: The absolute difference (np.subtract) for numerical columns, np.equal for everything else
a tuple with 2-4 entries: (source_column, ufunc [, postfunc: callable] [, target_column: str])
source column is the name of the column in df to compare
ufunc is a two-argument
:class:np.ufunc
which is pairwise applied to all combinations of the columnpostfunc is a one-argument function that is applied to the final, 1D result vector
target_column is the name of the column in the result dataframe (if missing, source column will be used)
If comparisons is missing, a default comparison will be created for every column
- Returns
A dataframe. Will have a column for each
comparison
spec and a row for each unique pair in the index. The order of rows will be similar to [(i, j) for i in 0..(n-1) for j in (i+1)..(n-1)].
Example
>>> df = pd.DataFrame({'Class': ['a', 'a', 'b'], 'Size': [42, 30, 5]}) >>> compare_pairwise(df) Class Size 0 1 True 12 2 False 37 1 2 False 25 >>> compare_pairwise(df, ['Class', ('Size', np.subtract, np.absolute, 'Size_Diff'), ('Size', np.add, 'Size_Total')]) Class Size_Diff Size_Total 0 1 True 12 72 2 False 37 47 1 2 False 25 35
Module contents¶
pydelta library¶
Stylometrics in Python
- class delta.Corpus(source=None, *, subdir=None, file=None, corpus=None, feature_generator=None, document_describer=<delta.util.DefaultDocumentDescriber object>, metadata=None, **kwargs)[source]¶
Bases:
pandas.core.frame.DataFrame
Creates a new Corpus.
You can create a corpus either from a filesystem subdir with raw text files, or from a CSV file with a document-term matrix, or from another corpus or dataframe that contains (potentially preprocessed) document/term vectors. Either option may be passed via appropriately named keyword argument or as the only positional argument, but exactly one must be present.
If you pass a subdirectory, Corpus will call a
FeatureGenerator
to read and parse the files and to generate a default word count. The default implementation will search for plain text files*.txt
inside the directory and parse them using a simple regular expression. It has a few options, e.g.,glob
andlower_case
, that can also be passed directly to corpus as keyword arguments. E.g.,Corpus('texts', glob='plain-*.txt', lower_case=True)
will look for files called plain-xxx.txt and convert it to lower case before tokenizing. SeeFeatureGenerator
for more details.The
document_describer
can contain per-document metadata which can be used, e.g, as ground truth.The
metadata
record contains global metadata (e.g., which transformations have already been performed), they will be inherited from acorpus
argument, all additional keyword arguments will be included with this record.- Parameters
source – Positional variant of either subdir, file, or corpus
- Keyword Arguments
subdir (str) – Path to a subdirectory containing the (unprocessed) corpus data.
file (str) – Path to a CSV file containing the feature vectors.
corpus (pandas.DataFrame) – A dataframe or
Corpus
from which to create a new corpus, as a copy.feature_generator (FeatureGenerator) – A customizeable helper class that will process a
subdir
to a feature matrix, if thesubdir
argument is also given. If None, a default feature generator will be used.metadata (dict) – A dictionary with metadata to copy into the new corpus.
**kwargs – Additionally, if feature_generator is None and subdir is not None, you can pass FeatureGenerator arguments and they will be used when instantiating the feature generator Additional keyword arguments will be set in the metadata record of the new corpus.
Warning
You should either use a single positional argument (source) or one of subdir, file, or corpus as keyword arguments. In future versions, source will be positional-only.
- new_data(data, **metadata)[source]¶
Wraps the given
DataFrame
with metadata from this corpus object.- Parameters
data (pandas.DataFrame) – Raw data that is derived by, e.g., pandas filter operations
**metadata – Metadata fields that should be changed / modified
- save(filename='corpus_words.csv')[source]¶
Saves the corpus to a CSV file.
The corpus will be saved to a CSV file containing documents in the columns and features in the rows, i.e. a transposed representation. Document and feature labels will be saved to the first row or column, respectively.
A metadata file will be saved alongside the file.
- Parameters
filename (str) – The target file.
- is_absolute() → bool[source]¶
- Returns
True
if this is a corpus using absolute frequencies- Return type
- is_complete() → bool[source]¶
A corpus is complete as long as it contains the absolute frequencies of all features of all documents. Many operations like calculating the relative frequencies require a complete corpus. Once a corpus has lost its completeness, it is not possible to restore it.
- get_mfw_table(mfwords)[source]¶
Shortens the list to the given number of most frequent words and converts the word counts to relative frequencies.
This returns a new
Corpus
, the data in this object is not modified.- Parameters
mfwords (int) – number of most frequent words in the new corpus. 0 means all words.
See also
- Returns
a new sorted corpus shortened to
mfwords
- Return type
- filter_wordlist(filename, **kwargs)[source]¶
Returns a new corpus that contains the features from the given file.
This method will read the list of words from the given file and then return a new corpus that uses the features listed in the file, in the order they are in the file.
- Parameters
filename (str) – Path to the file to load. Each line contains one feature. Leading and trailing whitespace, lines starting with
#
, and empty lines are ignored.- Returns
New corpus with seelected features.
- filter_features(features, **metadata)[source]¶
Returns a new corpus that contains only the given features.
- Parameters
features (Iterable) – The features to select. If its in a file, use filter_wordlist
- cull(ratio=None, threshold=None, keepna=False)[source]¶
Removes all features that do not appear in a minimum number of documents.
- Parameters
ratio (float) – Minimum ratio of documents a word must occur in to be retained. Note that we’re always rounding towards the ceiling, i.e. if the corpus contains 10 documents and ratio=1/3, a word must occur in at least 4 documents (if this is >= 1, it is interpreted as threshold)
threshold (int) – Minimum number of documents a word must occur in to be retained
keepna (bool) – If set to True, the missing words in the returned corpus will be retained as
nan
instead of0
.
- Returns
- A new corpus witht the culled words removed. The original
corpus is left unchanged.
- Return type
- reparse(feature_generator, subdir=None, **kwargs)[source]¶
Parse or re-parse a set of documents with different settings.
This runs the given feature generator on the given or configured subdirectory. The feature vectors returned by the feature generator will replace or augment the corpus.
- Parameters
feature_generator (FeatureGenerator) – Will be used for extracting stuff.
subdir (str) – If given, will be passed to the feature generator for processing. Otherwise, we’ll use the subdir configured with this corpus.
**kwargs – Additional metadata for the returned corpus.
- Returns
- a new corpus with the respective columns replaced or added.
The current object will be left unchanged.
- Return type
- Raises
CorpusNotAbsolute – if called on a corpus with relative frequencies
- tokens() → pandas.core.series.Series[source]¶
Number tokens by text
- types() → pandas.core.series.Series[source]¶
Number of different features by text
- ttr_by_text() → pandas.core.series.Series[source]¶
Type/token ratio for each text.
- class delta.FeatureGenerator(lower_case: bool = False, encoding: str = 'utf-8', glob: str = '*.txt', skip: Optional[str] = None, token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0), max_tokens: Optional[int] = None, ngrams: Optional[int] = None, parallel: Union[int, bool, joblib.parallel.Parallel] = False, sort: str = 'documents', sparse: bool = False)[source]¶
Bases:
object
A feature generator is responsible for converting a subdirectory of files into a feature matrix (that will then become a corpus). If you need to customize the feature extraction process, create a custom feature generator and pass it into your
Corpus
constructor call along with itssubdir
argument.The default feature generator is able to process a directory of text files, tokenize each of the text files according to a regular expression, and count each token type for each file. To customize feature extraction, you have two options:
for simple customizations, just create a new FeatureGenerator and set the constructor arguments accordingly. Look in the docstring for
__init__()
for details.in more complex cases, create a subclass and override methods as you see fit.
On a feature generator passed in to
Corpus
, only two methods will be called:__call__()
, i.e. the object as a callable, to actually generatethe feature vector,
metadata
to obtain metadata fields that will be included inthe corresponding corpus.
So, if you wish to write a completely new feature generator, you can ignore the other methods.
- Parameters
lower_case (bool) – if
True
, normalize all tokens to lower case before counting themencoding (str) – the encoding to use when reading files
glob (str) – the pattern inside the subdirectory to find files.
skip (str) – don’t handle files that match this pattern
token_pattern (re.Regex) – The regular expression used to identify tokens. The default, LETTERS_PATTERN, will simply find sequences of unicode letters. WORD_PATTERN will find the shortest sequence of letters and apostrophes between two word boundaries (according to the simple word-boundary algorithm from Unicode regular expressions) that contains at least one letter.
max_tokens (int) – If set, stop reading each file after that many words.
ngrams (int) – Count token ngrams instead of single tokens
parallel (bool, int, Parallel) – If truish, read and parse files in parallel. The actual argument may be - None or False for no special processing - an int for the required number of jobs - a dictionary with Parallel arguments for finer control
sort (str) – Sort the final feature matrix by index before returning. Possible values: -
documents
,index
: Sort by document names -features
,columns
: sort by feature labels (ie words) -both
: sort along both axes - None or the empty string: Do not sortsparse (bool) – build a sparse dataframe. Requires Pandas >=1.0
- token_pattern: _regex.Pattern = regex.Regex('\\p{L}+', flags=regex.V0)¶
- logger = <Logger delta.corpus.FeatureGenerator (WARNING)>¶
- tokenize(lines)[source]¶
Tokenizes the given lines.
This method is called by
count_tokens()
. The default implementation will return an iterable of all tokens in the given :param:`lines` that matches thetoken_pattern
. The result of this method can further be postprocessed bypostprocess_tokens()
.- Parameters
lines – Iterable of strings in which to look for tokens.
- Returns
Iterable (default implementation generator) of tokens
- postprocess_tokens(tokens)[source]¶
Postprocesses the tokens according to the options provided when creating the feature generator..
Currently respects
lower_case
andngrams
. This is called by count_tokens after tokenizing.- Parameters
tokens – iterable of tokens as returned by
tokenize()
- Returns
iterable of postprocessed tokens
- count_tokens(lines)[source]¶
This calls
tokenize()
to split the iterablelines
into tokens. If thelower_case
attribute is given, the tokens are then converted to lower_case. The tokens are counted, the method returns apd.Series
mapping each token to its number of occurrences.This is called by
process_file()
.- Parameters
lines – Iterable of strings in which to look for tokens.
- Returns
maps tokens to the number of occurrences.
- Return type
- get_name(filename)[source]¶
Converts a single file name to a label for the corresponding feature vector.
- Returns
Feature vector label (filename w/o extension by default)
- Return type
- process_file(filename)[source]¶
Processes a single file to a feature vector.
The default implementation reads the file pointed to by
filename
as a text file, callscount_tokens()
to create token counts andget_name()
to calculate the label for the feature vector.- Parameters
filename (str) – The path to the file to process
- Returns
- Feature counts, its name set according to
- Return type
pd.Series
- process_directory(directory)[source]¶
Iterates through the given directory and runs
process_file()
for each file matchingglob
in there.
- property metadata¶
Returns: Metadata: metadata record that describes the parameters of the
features used for corpora created using this feature generator.
- class delta.Normalization(f, name=None, title=None, register=True)[source]¶
Bases:
object
Wrapper for normalizations.
- delta.normalization(*args, **kwargs)[source]¶
Decorator that creates a
Normalization
from a function or (callable) object. Can be used without or with keyword arguments:name (str): Name (identifier) for the normalization. By default, the function’s name is used. title (str): Human-readable title for the normalization.
- class delta.DeltaFunction(f=None, descriptor=None, name=None, title=None, register=True)[source]¶
Bases:
object
Abstract base class of a delta function.
To define a delta function, you have various options:
subclass DeltaFunction and override its
__call__()
method with something that directly handles aCorpus
.subclass DeltaFunction and override its
distance()
method with a distance functioninstantiate DeltaFunction and pass it a distance function, or use the
delta()
decoratoruse one of the subclasses
Creates a custom delta function.
- Parameters
f (function) – a distance function that calculates the difference between two feature vectors and returns a float. If passed, this will be used for the implementation.
name (str) – The name/id of this function. Can be inferred from
f
ordescriptor
.descriptor (str) – The descriptor to identify this function.
title (str) – A human-readable title for this function.
register (bool) – If true (default), register this delta function with the function registry on instantiation.
- static distance(u, v, *args, **kwargs)[source]¶
Calculate a distance between two feature vectors.
This is an abstract method, you must either inherit from DeltaFunction and override distance or assign a function in order to use this.
- Parameters
u (pandas.Series) – The documents to compare.
v (pandas.Series) – The documents to compare.
*args – Passed through from the caller
**kwargs –
Passed through from the caller
- Returns
Distance between the documents.
- Return type
- Raises
NotImplementedError if no implementation is provided. –
- iterate_distance(corpus, *args, **kwargs)[source]¶
Calculates the distance matrix for the given corpus.
The default implementation will iterate over all pairwise combinations of the documents in the given corpus and call
distance()
on each pair, passing on the additional arguments.Clients may want to use
__call__()
instead, i.e. they want to call this object as a function.- Parameters
corpus (Corpus) – feature matrix for which to calculate the distance
*args – further arguments for the matrix
**kwargs –
further arguments for the matrix
- Returns
- square dataframe containing pairwise distances.
The default implementation will return a matrix that has zeros on the diagonal and the lower triangle a mirror of the upper triangle.
- Return type
- create_result(df, corpus)[source]¶
Wraps a square dataframe to a DistanceMatrix, adding appropriate metadata from corpus and this delta function.
- Parameters
df (pandas.DataFrame) – Distance matrix like created by
iterate_distance()
corpus (Corpus) – source feature matrix
- Returns
df as values, appropriate metadata
- Return type
- prepare(corpus)[source]¶
Return the corpus prepared for the metric, if applicable.
Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.
If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.
The default implementation simply returns the corpus as-is.
- Raises
NotImplementedError if there is no metric –
- class delta.PDistDeltaFunction(metric, name=None, title=None, register=True, scale=False, **kwargs)[source]¶
Bases:
delta.deltas.DeltaFunction
Wraps one of the metrics implemented by
ssd.pdist()
as a delta function.Warning
You should use MetricDeltaFunction instead.
- class delta.MetricDeltaFunction(metric, name=None, title=None, register=True, scale=False, fix_symmetry=True, **kwargs)[source]¶
Bases:
delta.deltas.DeltaFunction
Distance functions based on scikit-learn’s
sklearn.metric.pairwise_distances()
.- Parameters
metric (str) – The metric that should be called via sklearn.metric.pairwise_distances
name (str) – Name / Descriptor for the delta function, if None, metric is used
title (str) – Human-Readable Title
register (bool) – If false, don’t register this with the registry
scale (bool) – Scale by number of features
fix_symmetry – Force the resulting matrix to be symmetric
**kwargs – passed on to
ssd.pdist()
Note
sklearn.metric.pairwise_distances()
fast, but the result may not be exactly symmetric. Thefix_symmetry
option enforces symmetry by mirroring the lower-left triangle after calculating distances so, e.g., scipy clustering won’t complain.
- class delta.CompositeDeltaFunction(descriptor, name=None, title=None, register=True)[source]¶
Bases:
delta.deltas.DeltaFunction
A composite delta function consists of a basis (which is another delta function) and a list of normalizations. It first transforms the corpus via all the given normalizations in order, and then runs the basis on the result.
Creates a new composite delta function.
- Parameters
descriptor (str) – Formally defines this delta function. First the name of an existing, registered distance function, then, separated by
-
, the names of normalizations to run, in order.name (str) – Name by which this delta function is registered, in addition to the descriptor
title (str) – human-readable title
register (bool) – If true (the default), register this delta function on creation
- prepare(corpus)[source]¶
Return the corpus prepared for the metric, if applicable.
Many delta functions consist of a preparation step that normalizes the corpus in some way and a relatively standard distance metric that is one of the built-in distance metrics of scikit-learn or scipy.
If a specific delta variant supports this, it should expose a metric attribute set to a string or a callable that implements the metric, and possibly override this method in order to perform the preparation steps.
The default implementation simply returns the corpus as-is.
- Raises
NotImplementedError if there is no metric –
- class delta.Clustering(distance_matrix, method='ward', **kwargs)[source]¶
Bases:
object
Represents a hierarchical clustering.
Note
This is subject to refactoring once we implement more clustering methods
- class delta.FlatClustering(distances, clusters=None, metadata=None, **kwargs)[source]¶
Bases:
object
A flat clustering represents a non-hierarchical clustering.
Notes
FlatClustering uses a data frame field called
data
to store the actual clustering. This field will have the same index as the distance matrix, and three columns labeledGroup
,GroupID
, andCluster
.Group
will be the group label returned by theDocumentDescriber
we use,GroupID
a numerical ID for each group (to be used as ground truth) andCluster
the numerical ID of the actual cluster associated by the clustering algorithm.As long as FlatClusterings
initialized
property isFalse
, the Clustering is not assigned yet.- static ngroups(df)[source]¶
With df being a data frame that has a Group column, return the number of different authors in df.
- cluster_errors()[source]¶
Calculates the number of cluster errors by:
calculating the total number of different authors in the set
calling sch.fcluster to generate at most that many flat clusters
for each of those clusters, the cluster errors are the number of authors in this cluster - 1
sum of each cluster’s errors = result
- purity()[source]¶
To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measured by counting the number of correctly assigned documents and dividing by $N$
- adjusted_rand_index()[source]¶
Calculates the Adjusted Rand Index for the given flat clustering http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html#sklearn.metrics.adjusted_rand_score
- delta.get_rfe_features(corpus, estimator=None, steps=[(10000, 1000), (1000, 200), (500, 25)], cv=True)[source]¶
- Parameters
corpus – containing document_describer,
estimator – supervised learning estimator,
steps – list of tuples (features_to_select, step)
cv – additional cross-validated selection.
- Returns
set of selected terms.
- Return type
rfe_terms
- class delta.Dendrogram(clustering, describer=None, ax=None, orientation='left', font_size=None, link_color='k', title='Corpus: {corpus}', xlabel='Delta: {delta_title}, {words} most frequent {features}')[source]¶
Bases:
object
Creates a dendrogram representation from a hierarchical clustering.
This is a wrapper around, and an improvement to,
sch.dendrogram()
, tailored for the use in pydelta.- Parameters
clustering (Clustering) – A hierarchical clustering.
describer (DocumentDescriber) – Document describer used for determining the groups and the labels for the documents used (optional). By default, the document describer inherited from the clustering is used.
ax (mpl.axes.Axes) – Axes object to draw on. Uses pyplot default axes if not provided.
orientation (str) – Orientation of the dendrogram. Currently, only “right” is supported (default).
font_size – Font size for the label, in points. If not provided,
sch.dendrogram()
calculates a default.link_color (str) – The color used for the links in the dendrogram, by default
k
(for black).title (str) – a title that will be printed on the plot. The string may be a template string as supported by
str.format_map()
with metadata field names in curly braces, it will be evaluated against the clustering’s metadata. If you passNone
here, no title will be added.
Notes
The dendrogram will be painted by matplotlib / pyplot using the default styles, which means you can use, e.g., :module:`seaborn` to influence the overall design of the image.
Dendrogram
handles coloring differently thansch.dendrogram()
: It will color the document labels according to the pre-assigned grouping (e.g., by author). To do so, it will build on matplotlib’s default color_cycle, and it will rotate, so if you need more colors, adjust the color_cycle accordingly.
- delta.compare_pairwise(df, comparisons=None)[source]¶
Builds a table with pairwise comparisons of specific columns in the dataframe df.
This function is intended to provide additional relative metadata to the pairwise distances of a (symmetric) DistanceMatrix. It will take a dataframe and compare its rows pairwise according to the second argument, returning a dataframe in the ‘vector’ form of
:func:ssd.squareform
.If your comparisons can be expressed as np.ufuncs, this will be quite efficient.
- Parameters
df – A dataframe. rows = instances, columns = features.
comparisons –
A list of comparison specs. Each spec should be either:
a column name (e.g., a string) for default settings: The absolute difference (np.subtract) for numerical columns, np.equal for everything else
a tuple with 2-4 entries: (source_column, ufunc [, postfunc: callable] [, target_column: str])
source column is the name of the column in df to compare
ufunc is a two-argument
:class:np.ufunc
which is pairwise applied to all combinations of the columnpostfunc is a one-argument function that is applied to the final, 1D result vector
target_column is the name of the column in the result dataframe (if missing, source column will be used)
If comparisons is missing, a default comparison will be created for every column
- Returns
A dataframe. Will have a column for each
comparison
spec and a row for each unique pair in the index. The order of rows will be similar to [(i, j) for i in 0..(n-1) for j in (i+1)..(n-1)].
Example
>>> df = pd.DataFrame({'Class': ['a', 'a', 'b'], 'Size': [42, 30, 5]}) >>> compare_pairwise(df) Class Size 0 1 True 12 2 False 37 1 2 False 25 >>> compare_pairwise(df, ['Class', ('Size', np.subtract, np.absolute, 'Size_Diff'), ('Size', np.add, 'Size_Total')]) Class Size_Diff Size_Total 0 1 True 12 72 2 False 37 47 1 2 False 25 35
- class delta.Metadata(*args, **kwargs)[source]¶
Bases:
collections.abc.Mapping
A metadata record contains information about how a particular object of the pyDelta universe has been constructed, or how it will be manipulated.
Metadata fields are simply attributes, and they can be used as such.
Create a new metadata instance. Arguments will be passed on to
update()
.Examples
>>> m = Metadata(lower_case=True, sorted=False) >>> Metadata(m, sorted=True, words=5000) Metadata(lower_case=True, sorted=True, words=5000)
- static metafilename(filename)[source]¶
Returns an appropriate metadata filename for the given filename.
>>> Metadata.metafilename("foo.csv") 'foo.csv.meta' >>> Metadata.metafilename("foo.csv.meta") 'foo.csv.meta'
- classmethod load(filename)[source]¶
Loads a metadata instance from the filename identified by the argument.
- Parameters
filename (str) – The name of the metadata file, or of the file to which a sidecar metadata filename exists
- save(filename, **kwargs)[source]¶
Saves the metadata instance to a JSON file.
- Parameters
filename (str) – Name of the metadata file or the source file
**kwargs – are passed on to
json.dump()
- to_json(**kwargs)[source]¶
Returns a JSON string containing this metadata object’s contents.
- Parameters
**kwargs – Arguments passed to
json.dumps()
- class delta.TableDocumentDescriber(table, group_col, name_col, dialect='excel', **kwargs)[source]¶
Bases:
delta.util.DocumentDescriber
A document decriber that takes groups and item labels from an external table.
- Parameters
table (str or pandas.DataFrame) – A table with metadata that describes the documents of the corpus, either a
pandas.DataFrame
or path or IO to a CSV file. The tables index (or first column for CSV files) contains the document ids that are returned by theFeatureGenerator
. The columns (or first row) contains column labels.group_col (str) – Name of the column in the table that contains the names of the groups. Will be used, e.g., for determining the ground truth for cluster evaluation, and for coloring the dendrograms.
name_col (str) – Name of the column in the table that contains the names of the individual items.
dialect (str or
csv.Dialect
) – CSV dialect to use for reading the file.**kwargs – Passed on to
pandas.read_table()
.
- Raises
ValueError – when arguments inconsistent
- See:
pandas.read_table
- class delta.KMedoidsClustering(corpus, delta, n_clusters=None, extra_args={}, metadata=None, **kwargs)[source]¶
Bases:
delta.cluster.FlatClustering