Title: | Tools for Natural Language Processing in French |
---|---|
Description: | Tools for Natural Language Processing in French and texts from Marcel Proust's collection "A La Recherche Du Temps Perdu". The novels contained in this collection are "Du cote de chez Swann ", "A l'ombre des jeunes filles en fleurs","Le Cote de Guermantes", "Sodome et Gomorrhe I et II", "La Prisonniere", "Albertine disparue", and "Le Temps retrouve". |
Authors: | Colin Fay [aut, cre] |
Maintainer: | Colin Fay <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.0 |
Built: | 2024-11-12 02:40:05 UTC |
Source: | https://github.com/colinfay/proustr |
A dataset containing Marcel Proust's "Albertine disparue". This text has been downloaded from WikiSource.
albertinedisparue
albertinedisparue
A tibble with text, book, volume, and year
<https://fr.wikisource.org/wiki/Albertine_disparue>
A dataset containing Marcel Proust's "À l’ombre des jeunes filles en fleurs". This text has been downloaded from WikiSource.
alombredesjeunesfillesenfleurs
alombredesjeunesfillesenfleurs
A tibble with text, book, volume, and year
<https://fr.wikisource.org/wiki/
A dataset containing Marcel Proust's "Du côté de chez Swann". This text has been downloaded from WikiSource.
ducotedechezswann
ducotedechezswann
A tibble with text, book, volume, and year
<https://fr.wikisource.org/wiki/Du_c
A dataset containing Marcel Proust's "La prisonnière". This text has been downloaded from WikiSource.
laprisonniere
laprisonniere
A tibble with text, book, volume, and year
<https://fr.wikisource.org/wiki/La_Prisonni
A dataset containing Marcel Proust's "À l’ombre des jeunes filles en fleurs". This text has been downloaded from WikiSource.
lecotedeguermantes
lecotedeguermantes
A tibble with text, book, volume, and year
<https://fr.wikisource.org/wiki/Le_C
A dataset containing Marcel Proust's "Le temps retrouvé". This text has been downloaded from WikiSource.
letempretrouve
letempretrouve
A tibble with text, book, volume, and year.
<https://fr.wikisource.org/wiki/Le_Temps_retrouv
Detect the name of the days (in French)
pr_detect_days(df, col)
pr_detect_days(df, col)
df |
a dataframe |
col |
the column containing the text |
a tibble with the number of days detected by the algo
a <- data.frame(jours = c("C'est lundi 1er mars et mardi 2", "Et mercredi 3", "Il est revenu jeudi.")) pr_detect_days(a, jours)
a <- data.frame(jours = c("C'est lundi 1er mars et mardi 2", "Et mercredi 3", "Il est revenu jeudi.")) pr_detect_days(a, jours)
Detect the name of the months (in French)
pr_detect_months(df, col)
pr_detect_months(df, col)
df |
a dataframe |
col |
the column containing the text |
a tibble with the number of days detected by the algo
a <- data.frame(month = c("C'est lundi 1er mars et mardi 2", "Et mercredi 3", "Il est revenu en juin.")) pr_detect_months(a, month)
a <- data.frame(month = c("C'est lundi 1er mars et mardi 2", "Et mercredi 3", "Il est revenu en juin.")) pr_detect_months(a, month)
Detect the pronouns from a text (in French)
pr_detect_pro(df, col, verbose = FALSE)
pr_detect_pro(df, col, verbose = FALSE)
df |
a dataframe |
col |
the column containing the text |
verbose |
wether or not to return the list of pronouns. Defaults is FALSE |
The shortcuts in the pronoun col stand for:
pps: first person singular (première personne du singulier)
dps: second person singular (deuxième personne du singulier)
tps: third person singular (troisième personne du singulier)
ppp: first person plural (première personne du pluriel)
dpp: second person singular (deuxième personne du pluriel)
tpp: third person singular (troisième personne du pluriel)
a tibble with the detected pronouns
library(proustr) a <- proust_books()[1,] pr_detect_pro(a, text, verbose = TRUE) pr_detect_pro(a, text)
library(proustr) a <- proust_books()[1,] pr_detect_pro(a, text, verbose = TRUE) pr_detect_pro(a, text)
Remove non alnum elements
pr_keep_only_alnum(text, replacement = " ")
pr_keep_only_alnum(text, replacement = " ")
text |
a vector |
replacement |
what to replace the non alnum with. Defaut is " ". |
a vector
pr_keep_only_alnum("neuilly-en-thelle")
pr_keep_only_alnum("neuilly-en-thelle")
Normalize a text written with usual french punctuation
pr_normalize_punc(df, col)
pr_normalize_punc(df, col)
df |
a dataframe |
col |
the column to normalize |
a tibble with normalized text
a <- proustr::albertinedisparue[1:20,] pr_normalize_punc(albertinedisparue, text)
a <- proustr::albertinedisparue[1:20,] pr_normalize_punc(albertinedisparue, text)
Implementation of the SnowballC stemmer. Note that punctuation and capital letters are removed when processing.
pr_stem_sentences(df, col, language = "french")
pr_stem_sentences(df, col, language = "french")
df |
the data.frame containing the text |
col |
the column with the text |
language |
the language of the text. Defaut is french. See SnowballC::getStemLanguages() function for a list of supported languages. |
a tibble
a <- proustr::laprisonniere[1:10,] pr_stem_sentences(a, text)
a <- proustr::laprisonniere[1:10,] pr_stem_sentences(a, text)
Implementation of the SnowballC stemmer. Note that punctuation and capitals letters are also removed.
pr_stem_words(df, col, language = "french")
pr_stem_words(df, col, language = "french")
df |
the data.frame containing the sentences |
col |
the column with the sentences |
language |
the language of the words Defaut is french. See SnowballC::getStemLanguages() function for a list of supported languages. |
a tibble
a <- data.frame(words = c("matin", "heure", "fatigué","sonné","lois", "tests","fusionner")) pr_stem_words(a, words)
a <- data.frame(words = c("matin", "heure", "fatigué","sonné","lois", "tests","fusionner")) pr_stem_words(a, words)
Remove accents from a character vector
pr_unacent(text)
pr_unacent(text)
text |
a vector |
a vector
pr_unacent("du chêne")
pr_unacent("du chêne")
Returns a tidy tibble of Marcel Proust's 7 novels from À la recherche du temps perdu. The tibble contains four columns: text, book, volume and year.
proust_books()
proust_books()
A tibble with four columns: text
, book
, volume
and year
.
#Create the tibble proust <- proust_books()
#Create the tibble proust <- proust_books()
A dataset containing Marcel Proust's characters from "À la recherche du temps perdu" and their frequency in each book. This dataset has been downloaded from proust-personnages.
proust_char
proust_char
A tibble with their name
http://proust-personnages.fr/?page_id=10254
Returns a tidy data frame of Marcel Proust's characters.
proust_characters()
proust_characters()
A tibble
#Creates the tibble proust <- proust_characters()
#Creates the tibble proust <- proust_characters()
Create your own flavor of Proust with this random extractor.
proust_random(count = 1, collapse = TRUE)
proust_random(count = 1, collapse = TRUE)
count |
the number of line you want to randomly extract and paste. |
collapse |
if FALSE, the output will be a tibble. Default is TRUE, a character vector. |
a character vector
proust_random(4)
proust_random(4)
Old sentiment lexicon This function has been deprecated, and will be in next proustr version. See the rfeel package now: http://github.com/ColinFay/rfeel
proust_sentiments(type = c("polarity", "score"))
proust_sentiments(type = c("polarity", "score"))
type |
For backward compatibility |
a tibble
Stop words concatenated from various web sources.
proust_stopwords()
proust_stopwords()
a tibble with stopwords
https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt
proust_stopwords()
proust_stopwords()
A dataset containing Marcel Proust's "Sodom et Gomorrhe". This text has been downloaded from WikiSource.
sodomeetgomorrhe
sodomeetgomorrhe
A tibble with text, book, volume, and year
<https://fr.wikisource.org/wiki/Sodome_et_Gomorrhe>
ISO stopwords
stop_words
stop_words
A tibble
https://raw.githubusercontent.com/stopwords-iso/stopwords-iso/master/stopwords-iso.json