Recherche improbable d’une homogène diversité : Le débat sur l’identité nationale

Pierre Ratinaud; Pascal Marchand

Langages n° 187 (3/2012)

L'analyse du corpus face à l'hétérogénéité des données

Parution

octobre 2012

EAN

9782200927769

Prix au numéro

18 €

Marque

Armand Colin

Pagination

136 pages

Voir le sommaire

Pour acheter ce numéro, contactez-nous

Recevez les numéros de l'année en cours et accédez à l'intégralité des articles en ligne.

Auteur(s)

Pierre Ratinaud

Pascal Marchand

Recherche improbable d’une homogène diversité : Le débat sur l’identité nationale

Résumé

Dans cet article, nous comparons les effets de deux méthodes de correction morphologique d’un corpus issu du web sur des classifications de type ALCESTE avec le logiciel IRAMUTEQ. À partir des 18 240 contributions au débat sur l’identité nationale, nous comparons le corpus initial avec un corpus corrigé manuellement et un corpus corrigé par une méthode semi-automatique reposant sur une utilisation particulière du correcteur Hunspell. Les trois corpus obtenus (initial, automatique et manuel) sont soumis à deux classifications hiérarchiques descendantes : l’une conserve les 1 500 formes pleines les plus fréquentes, l’autre les 3 000 formes pleines les plus fréquentes. La comparaison deux à deux des résultats obtenus sur chacun des corpus montre que la correction automatique que nous proposons permet de se rapprocher significativement d’une correction manuelle.

Mots clés

correction morphologique

correction automatique

hétérogénéité

homogénéité

classification ALCESTE

Improbable search of a homogeneous diversity : The debate on national identity

Abstract

In this paper, we compare the effects of two methods of morphological correction of corpuscoming from the web on ALCESTE analysis made with the IRAMUTEQ software. From the18 240 contributions to the debate on national identity, we compare the initial corpus with amanually corrected one and with a semi-automatic correction method based on a particularused of the Hunspell corrector. The three corpora (initial, automatic and manual) are used intwo different hierarchical clustering: one that retain the 1 500 most frequent words and onethat retain the 3 000 most frequent words. The comparison of results obtained on each corpusshows that the automatic correction that we proposed allow to come significantly closer to amanual one.

Keywords

morphological correction

automatic correction

heterogeneity

homogeneity

ALCESTE clustering

Citation

Pierre Ratinaud, Pascal Marchand, « Recherche improbable d’une homogène diversité : Le débat sur l’identité nationale », Langages n° 187 (3/2012), pp. 93-107, Armand Colin. Disponible sur : http://www.revues.armand-colin.com/lettres-langues/langages/langages-ndeg-187-32012-lanalyse-du-corpus-face-lheterogeneite-donnees/recherche-improbable-dune-homogene-diversite-debat

Consulter l'article

Citer l'article