Skip to content

Commit

Permalink
Update geneteurs notebook, add anacycliques notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
clement-plancq committed Nov 30, 2016
1 parent da5c727 commit 609872e
Show file tree
Hide file tree
Showing 2 changed files with 424 additions and 43 deletions.
377 changes: 377 additions & 0 deletions anacycliques.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Anacycliques\n",
"\n",
"Trouver tous les anacycliques dans une liste de mots (issue de lexique.org par exemple). Les palindromes sont des cas particuliers d'anacycliques. \n",
"Ex : 'été' est un palidrome \n",
"'vu' et 'uv', 'tort' et 'trot' sont des anacycliques\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tester si deux mots sont des anacycliques est très simple en Python. Pour cela on peut utiliser les slices avec le troisième argument pour le pas. \n",
"`word[::-1]` : un pas de `-1` permet d'inverser la chaîne de caractères."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def is_anacyclique(word1, word2):\n",
" \"\"\"\n",
" Returns True if the given words are palindromes\n",
" False if not\n",
" \"\"\"\n",
" if word1 == word2[::-1]:\n",
" return True\n",
" else:\n",
" return False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"is_anacyclique(\"tort\", \"trot\")"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nb de mots : 142641\n"
]
}
],
"source": [
"words = []\n",
"with open('lexique381.ortho', 'r') as f:\n",
" for line in f:\n",
" line = line.rstrip()\n",
" if len(line) != 1:\n",
" words.append(line)\n",
"print(\"Nb de mots : {}\".format(len(words)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tester l'appartenance d'un élément à une liste de 142641 éléments est couteux en temps de traitement. \n",
"Le temps du d'appartenance d'un élément à une liste est de 0(n) (voir [TimeComplexity](https://wiki.python.org/moin/TimeComplexity) sur le wiki Python). \n",
"Ici comme la liste est déjà pas mal grande et que l'opération est répétée n fois, le temps de traitement est rhédibitoire."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Puisqu'on aime se compliquer la vie on peut essayer de réduite la taille de la liste en se limitant aux mots de même taille. Les anacycliques sont forcément de même taille."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def words_by_length(words):\n",
" \"\"\"\n",
" Receives a list of words sorted by length, returns a list of words for each length\n",
" \"\"\"\n",
" for length in range(len(words[0]), len(words[-1]) + 1):\n",
" yield [word for word in words if len(word) == length]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"res = set()\n",
"sorted_words = sorted(words, key = lambda x:len(x))\n",
"for words_n in words_by_length(sorted_words):\n",
" for word in words_n:\n",
" acyclique = word[::-1]\n",
" if acyclique in words_n:\n",
" res.add(tuple([word, acyclique]))\n",
" \n",
"for match in sorted(res):\n",
" print(match)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Malgré le découpage du lexique en mots de même taille et l'utilisation des générateurs, le temps de traitement reste important. \n",
"Essayons avec un dictionnaire."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nb de mots : 125627\n"
]
}
],
"source": [
"from collections import defaultdict\n",
"words_d = defaultdict()\n",
"with open('lexique381.ortho', 'r') as f:\n",
" for line in f:\n",
" line = line.rstrip()\n",
" if len(line) != 1:\n",
" words_d[line] = \"\"\n",
"print(\"Nb de mots : {}\".format(len(words_d.keys())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On a une différence de taille d'avec la liste parce que le dictionnaire supprime les doublons (125627 contre 142641). Mais ça ne change pas vraiment l'ordre du volume des données."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('aa', 'aa')\n",
"('ada', 'ada')\n",
"('ados', 'soda')\n",
"('adulé', 'éluda')\n",
"('aga', 'aga')\n",
"('agas', 'saga')\n",
"('ah', 'ha')\n",
"('ail', 'lia')\n",
"('ailé', 'élia')\n",
"('air', 'ria')\n",
"('ako', 'oka')\n",
"('alla', 'alla')\n",
"('alloc', 'colla')\n",
"('amis', 'sima')\n",
"('an', 'na')\n",
"('ana', 'ana')\n",
"('anas', 'sana')\n",
"('angor', 'rogna')\n",
"('annoté', 'étonna')\n",
"('ara', 'ara')\n"
]
}
],
"source": [
"res = set()\n",
"for word in words_d.keys():\n",
" acyclique = word[::-1]\n",
" if acyclique in words:\n",
" res.add(tuple([word, acyclique]))\n",
" \n",
"for match in sorted(res)[:20]:\n",
" print(match)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Là c'est tout de suite BEAUCOUP plus rapide <small>(supprimez `[:20]` pour avoir tous les résultats)</small>. Mais pourquoi ? \n",
"Parce que le test d'appartenance pour un dictionnaire est estimé à O(1) en moyenne (O(n) dans le pire des cas). \n",
"idem pour les ensembles (`set`)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nb de mots : 125627\n"
]
}
],
"source": [
"words_s = set()\n",
"with open('lexique381.ortho', 'r') as f:\n",
" for line in f:\n",
" line = line.rstrip()\n",
" if len(line) != 1:\n",
" words_s.add(line)\n",
"print(\"Nb de mots : {}\".format(len(words_s)))"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('aa', 'aa')\n",
"('ada', 'ada')\n",
"('ados', 'soda')\n",
"('adulé', 'éluda')\n",
"('aga', 'aga')\n",
"('agas', 'saga')\n",
"('ah', 'ha')\n",
"('ail', 'lia')\n",
"('ailé', 'élia')\n",
"('air', 'ria')\n",
"('ako', 'oka')\n",
"('alla', 'alla')\n",
"('alloc', 'colla')\n",
"('amis', 'sima')\n",
"('an', 'na')\n",
"('ana', 'ana')\n",
"('anas', 'sana')\n",
"('angor', 'rogna')\n",
"('annoté', 'étonna')\n",
"('ara', 'ara')\n"
]
}
],
"source": [
"res = set()\n",
"for word in words_s:\n",
" acyclique = word[::-1]\n",
" if acyclique in words:\n",
" res.add(tuple([word, acyclique]))\n",
" \n",
"for match in sorted(res)[:20]:\n",
" print(match)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Allez on va s'amuser à mesurer ça :"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n",
"Wall time: 3.02 ms\n",
"CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
"Wall time: 7.39 µs\n",
"CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
"Wall time: 8.82 µs\n"
]
},
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time \"palindrome\" in words\n",
"%time \"palindrome\" in words_d\n",
"%time \"palindrome\" in words_s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Si vous êtes intéressés par l'implémentation des dictionnaires vous pourrez lire l'article suivant : [Python dictionary implementation] (http://www.laurentluce.com/posts/python-dictionary-implementation/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2+"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit 609872e

Please sign in to comment.