Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
#!/usr/bin/env python3 # -*- coding: utf-8 -*-
Processing Text Data, Creating Matrices and Cleaning Corpora ============================================================
Functions of this module are for **preprocessing purpose**. You can read text \ files, `tokenize <https://en.wikipedia.org/wiki/Tokenization_(lexical_analysis)>`_ \ and segment documents, create `document-term matrices <https://en.wikipedia.org/wiki/Document-term_matrix>`_, \ determine and remove features and read existing matrices. Recurrent variable names are \ based on the following conventions:
1. Corpora: *********** * ``corpus`` means an iterable containing at least one ``document`` or ``dkpro_document``. * ``document`` means one single string containing all characters of a text \ file, including whitespaces, punctuations, etc. * ``dkpro_document`` means a pandas DataFrame containing tokens and additional \ information, e.g. *part-of-speech tags* or *lemmas*. * ``tokenized_corpus`` means an iterable containing at least one ``tokenized_document``. * ``tokenized_document`` means an iterable containing tokens of a ``document``. * ``clean_tokenized_corpus`` means an iterable containing at least one ``clean_tokenized_document``. * ``clean_tokenized_document`` means an iterable containing only specific \ tokens (e.g. no *stopwords* or hapax *legomena*) of a ``tokenized_document``. * ``document_labels`` means an iterable containing names of each ``document`` \ and must have as much elements as ``corpus``, ``tokenized_corpus`` or \ ``clean_tokenized_corpus``, respectively.
Furthermore, if a document is chunked into smaller segments, each segment counts as one document.
2. Data models: *************** * ``document_term_matrix`` means either a pandas DataFrame with rows corresponding to \ ``document_labels`` and columns to types (distinct tokens in the corpus). The \ single values are token frequencies, or a pandas DataFrame with a MultiIndex \ and only one column corresponding to word frequencies. The first column of the \ MultiIndex corresponds to a document ID (based on ``document_labels``) and the \ second column to a type ID.
Contents: ********* * :func:`create_document_term_matrix()` creates a document-term matrix, for either \ large or small corpora. * :func:`duplicate_document_label()` duplicates a ``document_label`` with consecutive \ numbers. * :func:`filter_dkpro_document()` filters a ``dkpro_document`` by specific \ *part-of-speech tags*. * :func:`find_hapax_legomena()` determines *hapax legomena* based on frequencies \ of a ``document_term_matrix``. * :func:`find_stopwords()` determines *most frequent words* based on frequencies \ of a ``document_term_matrix``. * :func:`read_from_pathlist()` reads one or multiple files based on a pathlist. * :func:`segment()` is a wrapper for :func:`segment_fuzzy()` and segments a \ ``tokenized_document`` into segments of a certain number of tokens, respecting existing chunks. * :func:`segment_fuzzy()` segments a ``tokenized_document``, tolerating existing \ chunks (like paragraphs). * :func:`split_paragraphs()` splits a ``document`` by paragraphs. * :func:`tokenize()` tokenizes a ``document`` based on a Unicode regular expression. * :func:`remove_features()` removes features from a ``document_term_matrix``. """
format='%(levelname)s %(name)s: %(message)s')
"""Opens TXT files using file paths.
Description: With this function you can read plain text files. Commit a list of full paths or one single path as argument. Use the function `create_document_list()` to create a list of your text files.
Args: doclist Union(list[str], str): List of all documents in the corpus or single path to TXT file.
Yields: Document.
Todo: * Separate metadata (author, header)
Example: >>> list(read_from_txt('corpus_txt/Doyle_AScandalinBohemia.txt'))[0][:20] 'A SCANDAL IN BOHEMIA' """ log.info("Accessing TXT documents ...") elif isinstance(doclist, list): for file in doclist: with open(file, 'r', encoding='utf-8') as f: yield f.read()
"""Opens TEI XML files using file paths.
Description: With this function you can read TEI encoded XML files. Commit a list of full paths or one single path as argument. Use the function `create_document_list()` to create a list of your XML files.
Args: doclist Union(list[str], str): List of all documents in the corpus or single path to TEI XML file.
Yields: Document.
Todo: * Seperate metadata (author, header)?
Example: >>> list(read_from_tei('corpus_tei/Schnitzler_Amerika.xml'))[0][142:159] 'Arthur Schnitzler' """ log.info("Accessing TEI XML documents ...")
"""Opens CSV files using file paths.
Description: With this function you can read CSV files generated by `DARIAH-DKPro-Wrapper`_, a tool for natural language processing. Commit a list of full paths or one single path as argument. You also have the ability to select certain columns. Use the function `create_document_list()` to create a list of your CSV files. .. _DARIAH-DKPro-Wrapper: https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper
Args: doclist Union(list[str], str): List of all documents in the corpus or single path to CSV file. columns (list[str]): List of CSV column names. Defaults to '['ParagraphId', 'TokenId', 'Lemma', 'CPOS', 'NamedEntity']'.
Yields: Document.
Todo: * Seperate metadata (author, header)?
Example: >>> list(read_from_csv('corpus_csv/Doyle_AScandalinBohemia.txt.csv'))[0][:4] # doctest: +NORMALIZE_WHITESPACE ParagraphId TokenId Lemma CPOS NamedEntity 0 0 0 a ART _ 1 0 1 scandal NP _ 2 0 2 in PP _ 3 0 3 bohemia NP _ """
"""Tokenizes with Unicode Regular Expressions.
Description: With this function you can tokenize a document with a regular expression. You also have the ability to commit your own regular expression. The default expression is '\p{Letter}+\p{Punctuation}?\p{Letter}+', which means one or more letters, followed by one or no punctuation, followed by one or more letters. So one letter words won't match. In case you want to lower alls tokens, set the argument `lower` to True (it is by default). If you want a very simple and primitive tokenization, set the argument `simple` to True. Use the functions `read_from_txt()`, `read_from_tei()` or `read_from_csv()` to read your text files.
Args: doc_txt (str): Document as string. expression (str): Regular expression to find tokens. lower (boolean): If True, lowers all words. Defaults to True. simple (boolean): Uses simple regular expression (r'\w+'). Defaults to False. If set to True, argument `expression` will be ignored.
Yields: Tokens
Example: >>> list(tokenize("This is one example text.")) ['this', 'is', 'one', 'example', 'text'] """ else:
"""Gets lemmas by selected POS-tags from DARIAH-DKPro-Wrapper output.
Description: With this function you can select certain columns of a CSV file generated by `DARIAH-DKPro-Wrapper`_, a tool for natural language processing. Use the function `read_from_csv()` to read CSV files. .. _DARIAH-DKPro-Wrapper: https://github.com/DARIAH-DE/DARIAH-DKPro-Wrapper
Args: doc_csv (DataFrame): DataFrame containing DARIAH-DKPro-Wrapper output. pos_tags (list[str]): List of DKPro POS-tags that should be selected. Defaults to '['ADJ', 'V', 'NN']'.
Yields: Lemma.
Example: >>> df = pd.DataFrame({'CPOS': ['CARD', 'ADJ', 'NN', 'NN'], ... 'Lemma': ['one', 'more', 'example', 'text']}) >>> list(filter_pos_tags(df))[0] # doctest: +NORMALIZE_WHITESPACE 1 more 2 example 3 text Name: Lemma, dtype: object """
"""Splits the given document by paragraphs.
Description: With this function you can split a document by paragraphs. You also have the ability to select a certain regular expression to split the document. Use the functions `read_from_txt()`, `read_from_tei()` or `read_from_csv()` to read your text files.
Args: doc_txt (str): Document text. sep (regex.Regex): Separator indicating a paragraph.
Returns: List of paragraphs.
Example: >>> split_paragraphs("This test contains \\n paragraphs.") ['This test contains ', ' paragraphs.'] """
"""Segments a document, tolerating existing chunks (like paragraphs).
Description: Consider you have a document. You wish to split the document into segments of about 1000 tokens, but you prefer to keep paragraphs together if this does not increase or decrease the token size by more than 5%.
Args: document: The document to process. This is an Iterable of chunks, each of which is an iterable of tokens. segment_size (int): The target length of each segment in tokens. tolerance (Number): How much may the actual segment size differ from the segment_size? If 0 < tolerance < 1, this is interpreted as a fraction of the segment_size, otherwise it is interpreted as an absolute number. If tolerance < 0, chunks are never split apart.
Yields: Segments. Each segment is a list of chunks, each chunk is a list of tokens.
Example: >>> list(segment_fuzzy([['This', 'test', 'is', 'very', 'clear'], ... ['and', 'contains', 'chunks']], 2)) # doctest: +NORMALIZE_WHITESPACE [[['This', 'test']], [['is', 'very']], [['clear'], ['and']], [['contains', 'chunks']]] """
# handle leftovers
tokenizer=None, flatten_chunks=False, materialize=False): """Segments a document into segments of about `segment_size` tokens, respecting existing chunks.
Description: Consider you have a document. You wish to split the document into segments of about 1000 tokens, but you prefer to keep paragraphs together if this does not increase or decrease the token size by more than 5%. This is a convenience wrapper around `segment_fuzzy()`.
Args: segment_size (int): The target size of each segment, in tokens. tolerance (Number): see `segment_fuzzy` chunker (callable): a one-argument function that cuts the document into chunks. If this is present, it is called on the given document. tokenizer (callable): a one-argument function that tokenizes each chunk. flatten_chunks (bool): if True, undo the effect of the chunker by chaining the chunks in each segment, thus each segment consists of tokens. This can also be a one-argument function in order to customize the un-chunking.
Example: >>> list(segment([['This', 'test', 'is', 'very', 'clear'], ... ['and', 'contains', 'chunks']], 2)) # doctest: +NORMALIZE_WHITESPACE [[['This', 'test']], [['is', 'very']], [['clear'], ['and']], [['contains', 'chunks']]] """
"""Removes features using feature list.
Description: With this function you can remove features from ppreprocessed files. Commit a list of features. Use the function `tokenize()` to access your files.
Args: doc_token_list Union(list[str], str): List of all documents in the corpus and their tokens. features_to_be_removed list[str]: List of features that should be removed Yields: cleaned token array
Todo:
Example: >>> doc_tokens = [['short', 'example', 'example', 'text', 'text']] >>> features_to_be_removed = ['example'] >>> test = remove_features_from_file(doc_tokens, features_to_be_removed) >>> list(test) [['short', 'text', 'text']] """ #log.info("Removing features ...") doc_token_array = np.array(doc_token_list) feature_array = np.array(features_to_be_removed) #get indices of features that should be deleted doc_token_array = np.delete(doc_token_array, indices)
"""Creates files for mallet import.
Description: With this function you can create preprocessed plain text files. Commit a list of full paths or one single path as argument. Use the function `remove_features_from_file()` to create a list of tokens per document.
Args: doc_tokens_cleaned Union(list[str], str): List of tokens per document doc_labels list[str]: List of documents labels.
Todo:
Example: >>> doc_labels = ['examplefile'] >>> doc_tokens_cleaned = [['short', 'example', 'text']] >>> create_mallet_import(doc_tokens_cleaned, doc_labels) >>> outpath = os.path.join('tutorial_supplementals', 'mallet_input') >>> os.path.isfile(os.path.join(outpath, 'examplefile.txt')) True """ #log.info("Generating mallet input files ...")
"""Creates a document-term matrix
Description: With this function you can create a document-term matrix where rows correspond to documents in the collection and columns correspond to terms. Use the function `tokenize()` to tokenize your text files and Use the function `_wordcounts()` to generate the wordcounts Args: doc_labels (list[str]): List of doc labels as string tokens (list): List of tokens.
Returns: DataFrame.
Example: >>> example = create_doc_term_matrix('example', 'label') >>> print(isinstance(example, pd.DataFrame)) >>> True """
"""Creates a Series with wordcounts
Description: Only the function 'create_doc_term_matrix() uses this private function.
Args: doc (list[tokens]): List of tokens label (String): String with document_label.
Returns: Pandas Series.
ToDo: Complete documetation
Example:
"""
"""Creates a dictionary of unique tokens with identifier.
Description: With this function you can create a dictionary of unique tokens as key and an identifier as value. Use the function `tokenize()` to tokenize your text files.
Args: tokens (list): List of tokens.
Returns: Dictionary.
Example: >>> create_dictionary(['example']) {'example': 1} """ if all(isinstance(element, list) for element in tokens): tokens = {token for element in tokens for token in element}
"""Creates a dictionary of dictionaries.
Description: Only the function `create_sparse_bow()` uses this private function to create a dictionary of dictionaries. The first level consists of the document label as key, and the dictionary of counts as value. The second level consists of token ID as key, and the count of tokens in document pairs as value.
Args: doc_labels (list): List of doc labels. doc_tokens (list): List of tokens. type_dictionary (dict): Dictionary of {token: id}.
Returns: Dictionary of dictionaries.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'example', 'example', 'text', 'text']] >>> type_dictionary = {'short': 1, 'example': 2, 'text': 3} >>> isinstance(_create_large_counter(doc_labels, doc_tokens, type_dictionary), defaultdict) True """ [type_dictionary[token] for token in tokens])
"""Creates a sparse index for pandas DataFrame.
Description: Only the function `create_sparse_bow()` uses this private function to create a pandas multiindex out of tuples. The multiindex represents document ID to token IDs relations.
Args: largecounter (dict): Dictionary of {document: {token: frequency}}.
Returns: Pandas MultiIndex.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'example', 'example', 'text', 'text']] >>> type_dictionary = {'short': 1, 'example': 2, 'text': 3} >>> largecounter = _create_large_counter(doc_labels, doc_tokens, type_dictionary) >>> isinstance(_create_sparse_index(largecounter), pd.MultiIndex) True """ if len(largecounter[key]) == 0: tuples.append((key, 0)) tuples, names=['doc_id', 'token_id'])
"""Creates sparse matrix for bag-of-words model.
Description: This function creates a sparse DataFrame ('bow' means `bag-of-words`_) containing document and type identifier as multiindex and type frequencies as values representing the counts of tokens for each token in each document. It is also the main function that incorporates the private functions `_create_large_counter()` and `_create_sparse_index()``. Use the function `get_labels()` for `doc_labels`, `tokenize()` for `doc_tokens`, and `create_dictionary()` for `type_dictionary` as well as for `doc_ids`. Use the function `create_dictionary()` to generate the dictionaries `type_dictionary` and `doc_dictionary`. .. _bag-of-words: https://en.wikipedia.org/wiki/Bag-of-words_model
Args: doc_labels (list[str]): List of doc labels as string. doc_tokens (list[str]): List of tokens as string. type_dictionary (dict[str]): Dictionary with {token: id}. doc_ids (dict[str]): Dictionary with {document label: id}.
Returns: Multiindexed Pandas DataFrame.
ToDo: * Test if it's necessary to build sparse_df_filled with int8 zeroes instead of int64. * Avoid saving sparse bow as .mm file to ingest into gensim.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'example', 'text']] >>> type_dictionary = {'short': 1, 'example': 2, 'text': 3} >>> doc_ids = {'exampletext': 1} >>> len(create_sparse_bow(doc_labels, doc_tokens, type_dictionary, doc_ids)) 3 """ doc_labels, doc_tokens, type_dictionary) np.zeros((len(sparse_index), 1), dtype=int), index=sparse_index) sparse_index.get_level_values('doc_id'))
(doc_id, token_id), 0, int(largecounter[doc_id][token_id]))
"""Saves sparse matrix for bag-of-words model.
Description: With this function you can save the sparse matrix as `.mm file`_. .. _.mm file: http://math.nist.gov/MatrixMarket/formats.html#MMformat
Args: sparse_bow (DataFrame): DataFrame with term and term frequency by document. output (str): Path to output file without extension, e.g. /tmp/sparsebow.
Returns: None.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'example', 'text']] >>> type_dictionary = {'short': 1, 'example': 2, 'text': 3} >>> doc_ids = {'exampletext': 1} >>> sparse_bow = create_sparse_bow(doc_labels, doc_tokens, type_dictionary, doc_ids) >>> save_sparse_bow(sparse_bow, 'sparsebow') >>> import os.path >>> os.path.isfile('sparsebow.mm') True """
" " + str(sum_counts) + "\n"
f.write("%%MatrixMarket matrix coordinate real general\n") f.write(header_string)
"""Creates a stopword list.
Description: With this function you can determine most frequent words, also known as stopwords. First, you have to translate your corpus into the bag-of-words model using the function `create_sparse_matrix()` and create an dictionary containing types and identifier using `create_dictionary()`.
Args: sparse_bow (DataFrame): DataFrame with term and term frequency by document. id_types (dict[str]): Dictionary with {token: id}. mfw (int): Target size of most frequent words to be considered.
Returns: Most frequent words in a list.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'short', 'example', 'text']] >>> id_types = {'short': 1, 'example': 2, 'text': 3} >>> doc_ids = {'exampletext': 1} >>> sparse_bow = create_sparse_bow(doc_labels, doc_tokens, id_types, doc_ids) >>> find_stopwords(sparse_bow, 1, id_types) ['short'] """ df.index.get_level_values('token_id')).sum() for key in sparse_bow_stopwords.index.get_level_values('token_id')] else:
"""Creates a list with hapax legommena.
Description: With this function you can determine hapax legomena for each document. First, you have to translate your corpus into the bag-of-words model using the function `create_sparse_matrix()` and create an dictionary containing types and identifier using `create_dictionary()`.
Args: sparse_bow (DataFrame): DataFrame with term and term frequency by document. id_types (dict[str]): Dictionary with {token: id}.
Returns: Hapax legomena in a list.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'example', 'example', 'text', 'text']] >>> id_types = {'short': 1, 'example': 2, 'text': 3} >>> doc_ids = {'exampletext': 1} >>> sparse_bow = create_sparse_bow(doc_labels, doc_tokens, id_types, doc_ids) >>> find_hapax(sparse_bow, id_types) ['short'] """ df.index.get_level_values('token_id')).sum() for key in sparse_bow_hapax.index.get_level_values('token_id')] else: #return df.loc[:,(df.isin([1])).any()].columns.tolist()
"""Removes features based on a list of words (types).
Description: With this function you can clean your corpus from stopwords and hapax legomena. First, you have to translate your corpus into the bag-of-words model using the function `create_sparse_bow()` and create a dictionary containing types and identifier using `create_dictionary()`. Use the functions `find_stopwords()` and `find_hapax()` to generate a feature list.
Args: sparse_bow (DataFrame): DataFrame with term and term frequency by document. features Union(set, list): Set or list containing features to remove. (not included) features (str): Text as iterable.
Returns: Clean corpus.
ToDo: * Adapt function to work with mm-corpus format.
Example: >>> doc_labels = ['exampletext'] >>> doc_tokens = [['short', 'example', 'example', 'text', 'text']] >>> id_types = {'short': 1, 'example': 2, 'text': 3} >>> doc_ids = {'exampletext': 1} >>> sparse_bow = create_sparse_bow(doc_labels, doc_tokens, id_types, doc_ids) >>> features = ['short'] >>> len(remove_features(sparse_bow, features, id_types)) 2 """ else:
"""Creates doc2bow_list for gensim.
Description: With this function you can create a doc2bow_list as input for the gensim function `get_document_topics()` to show topics for each document.
Args: sparse_bow (DataFrame): DataFrame with term and term frequency by document.
Returns: List of lists containing tuples.
Example: >>> doc_labels = ['exampletext1', 'exampletext2'] >>> doc_tokens = [['test', 'corpus'], ['for', 'testing']] >>> type_dictionary = {'test': 1, 'corpus': 2, 'for': 3, 'testing': 4} >>> doc_dictionary = {'exampletext1': 1, 'exampletext2': 2} >>> sparse_bow = create_sparse_bow(doc_labels, doc_tokens, type_dictionary, doc_dictionary) >>> from gensim.models import LdaModel >>> from gensim.corpora import Dictionary >>> corpus = [['test', 'corpus'], ['for', 'testing']] >>> dictionary = Dictionary(corpus) >>> documents = [dictionary.doc2bow(document) for document in corpus] >>> model = LdaModel(corpus=documents, id2word=dictionary, iterations=1, passes=1, num_topics=1) >>> make_doc2bow_list(sparse_bow) [[(1, 1), (2, 1)], [(3, 1), (4, 1)]] """ sparse_bow.loc[doc].index, sparse_bow.loc[doc][0])] return doc2bow_list
"""Converts lda output to a DataFrame
Description: With this function you can convert lda output to a DataFrame, a more convenient datastructure.
Note:
Args: model: LDA model. vocab (list[str]): List of strings containing corpus vocabulary. num_keys (int): Number of top keywords for topic
Returns: DataFrame
Example: >>> import lda >>> corpus = [['test', 'corpus'], ['for', 'testing']] >>> doc_term_matrix = create_doc_term_matrix(corpus, ['doc1', 'doc2']) >>> vocab = doc_term_matrix.columns >>> model = lda.LDA(n_topics=1, n_iter=1) >>> model.fit(doc_term_matrix.as_matrix().astype(int)) >>> df = lda2dataframe(model, vocab, num_keys=1) >>> len(df) == 1 True """
"""Converts gensim output to DataFrame.
Description: With this function you can convert gensim output (usually a list of tuples) to a DataFrame, a more convenient datastructure.
Args: model: Gensim LDA model. num_keys (int): Number of top keywords for topic.
Returns: DataFrame.
ToDo:
Example: >>> from gensim.models import LdaModel >>> from gensim.corpora import Dictionary >>> corpus = [['test', 'corpus'], ['for', 'testing']] >>> dictionary = Dictionary(corpus) >>> documents = [dictionary.doc2bow(document) for document in corpus] >>> model = LdaModel(corpus=documents, id2word=dictionary, iterations=1, passes=1, num_topics=1) >>> isinstance(gensim2dataframe(model, 4), pd.DataFrame) True """ columns= range(num_keys))
topics = model.show_topics(num_topics = model.num_topics, formatted=False)
for topic, values in topics:
"""Creates a doc_topic_matrix for lda output.
Description: With this function you can convert lda output to a DataFrame, a more convenient datastructure. Use 'lda2DataFrame()' to get topics.
Note:
Args: model: Gensim LDA model. topics: DataFrame. doc_labels (list[str]): List of doc labels as string.
Returns: DataFrame
Example: >>> import lda >>> corpus = [['test', 'corpus'], ['for', 'testing']] >>> doc_term_matrix = create_doc_term_matrix(corpus, ['doc1', 'doc2']) >>> vocab = doc_term_matrix.columns >>> model = lda.LDA(n_topics=1, n_iter=1) >>> model.fit(doc_term_matrix.as_matrix().astype(int)) >>> topics = lda2dataframe(model, vocab) >>> doc_topic = lda_doc_topic(model, vocab, ['doc1', 'doc2']) >>> len(doc_topic.T) == 2 True """ |