diff --git a/anacycliques.ipynb b/anacycliques.ipynb new file mode 100644 index 0000000..781334a --- /dev/null +++ b/anacycliques.ipynb @@ -0,0 +1,377 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Anacycliques\n", + "\n", + "Trouver tous les anacycliques dans une liste de mots (issue de lexique.org par exemple). Les palindromes sont des cas particuliers d'anacycliques. \n", + "Ex : 'été' est un palidrome \n", + "'vu' et 'uv', 'tort' et 'trot' sont des anacycliques\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tester si deux mots sont des anacycliques est très simple en Python. Pour cela on peut utiliser les slices avec le troisième argument pour le pas. \n", + "`word[::-1]` : un pas de `-1` permet d'inverser la chaîne de caractères." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def is_anacyclique(word1, word2):\n", + " \"\"\"\n", + " Returns True if the given words are palindromes\n", + " False if not\n", + " \"\"\"\n", + " if word1 == word2[::-1]:\n", + " return True\n", + " else:\n", + " return False" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "is_anacyclique(\"tort\", \"trot\")" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nb de mots : 142641\n" + ] + } + ], + "source": [ + "words = []\n", + "with open('lexique381.ortho', 'r') as f:\n", + " for line in f:\n", + " line = line.rstrip()\n", + " if len(line) != 1:\n", + " words.append(line)\n", + "print(\"Nb de mots : {}\".format(len(words)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Tester l'appartenance d'un élément à une liste de 142641 éléments est couteux en temps de traitement. \n", + "Le temps du d'appartenance d'un élément à une liste est de 0(n) (voir [TimeComplexity](https://wiki.python.org/moin/TimeComplexity) sur le wiki Python). \n", + "Ici comme la liste est déjà pas mal grande et que l'opération est répétée n fois, le temps de traitement est rhédibitoire." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Puisqu'on aime se compliquer la vie on peut essayer de réduite la taille de la liste en se limitant aux mots de même taille. Les anacycliques sont forcément de même taille." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def words_by_length(words):\n", + " \"\"\"\n", + " Receives a list of words sorted by length, returns a list of words for each length\n", + " \"\"\"\n", + " for length in range(len(words[0]), len(words[-1]) + 1):\n", + " yield [word for word in words if len(word) == length]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "res = set()\n", + "sorted_words = sorted(words, key = lambda x:len(x))\n", + "for words_n in words_by_length(sorted_words):\n", + " for word in words_n:\n", + " acyclique = word[::-1]\n", + " if acyclique in words_n:\n", + " res.add(tuple([word, acyclique]))\n", + " \n", + "for match in sorted(res):\n", + " print(match)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "Malgré le découpage du lexique en mots de même taille et l'utilisation des générateurs, le temps de traitement reste important. \n", + "Essayons avec un dictionnaire." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nb de mots : 125627\n" + ] + } + ], + "source": [ + "from collections import defaultdict\n", + "words_d = defaultdict()\n", + "with open('lexique381.ortho', 'r') as f:\n", + " for line in f:\n", + " line = line.rstrip()\n", + " if len(line) != 1:\n", + " words_d[line] = \"\"\n", + "print(\"Nb de mots : {}\".format(len(words_d.keys())))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "On a une différence de taille d'avec la liste parce que le dictionnaire supprime les doublons (125627 contre 142641). Mais ça ne change pas vraiment l'ordre du volume des données." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('aa', 'aa')\n", + "('ada', 'ada')\n", + "('ados', 'soda')\n", + "('adulé', 'éluda')\n", + "('aga', 'aga')\n", + "('agas', 'saga')\n", + "('ah', 'ha')\n", + "('ail', 'lia')\n", + "('ailé', 'élia')\n", + "('air', 'ria')\n", + "('ako', 'oka')\n", + "('alla', 'alla')\n", + "('alloc', 'colla')\n", + "('amis', 'sima')\n", + "('an', 'na')\n", + "('ana', 'ana')\n", + "('anas', 'sana')\n", + "('angor', 'rogna')\n", + "('annoté', 'étonna')\n", + "('ara', 'ara')\n" + ] + } + ], + "source": [ + "res = set()\n", + "for word in words_d.keys():\n", + " acyclique = word[::-1]\n", + " if acyclique in words:\n", + " res.add(tuple([word, acyclique]))\n", + " \n", + "for match in sorted(res)[:20]:\n", + " print(match)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Là c'est tout de suite BEAUCOUP plus rapide (supprimez `[:20]` pour avoir tous les résultats). Mais pourquoi ? \n", + "Parce que le test d'appartenance pour un dictionnaire est estimé à O(1) en moyenne (O(n) dans le pire des cas). \n", + "idem pour les ensembles (`set`)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nb de mots : 125627\n" + ] + } + ], + "source": [ + "words_s = set()\n", + "with open('lexique381.ortho', 'r') as f:\n", + " for line in f:\n", + " line = line.rstrip()\n", + " if len(line) != 1:\n", + " words_s.add(line)\n", + "print(\"Nb de mots : {}\".format(len(words_s)))" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('aa', 'aa')\n", + "('ada', 'ada')\n", + "('ados', 'soda')\n", + "('adulé', 'éluda')\n", + "('aga', 'aga')\n", + "('agas', 'saga')\n", + "('ah', 'ha')\n", + "('ail', 'lia')\n", + "('ailé', 'élia')\n", + "('air', 'ria')\n", + "('ako', 'oka')\n", + "('alla', 'alla')\n", + "('alloc', 'colla')\n", + "('amis', 'sima')\n", + "('an', 'na')\n", + "('ana', 'ana')\n", + "('anas', 'sana')\n", + "('angor', 'rogna')\n", + "('annoté', 'étonna')\n", + "('ara', 'ara')\n" + ] + } + ], + "source": [ + "res = set()\n", + "for word in words_s:\n", + " acyclique = word[::-1]\n", + " if acyclique in words:\n", + " res.add(tuple([word, acyclique]))\n", + " \n", + "for match in sorted(res)[:20]:\n", + " print(match)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Allez on va s'amuser à mesurer ça :" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n", + "Wall time: 3.02 ms\n", + "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", + "Wall time: 7.39 µs\n", + "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", + "Wall time: 8.82 µs\n" + ] + }, + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "%time \"palindrome\" in words\n", + "%time \"palindrome\" in words_d\n", + "%time \"palindrome\" in words_s" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Si vous êtes intéressés par l'implémentation des dictionnaires vous pourrez lire l'article suivant : [Python dictionary implementation] (http://www.laurentluce.com/posts/python-dictionary-implementation/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2+" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/generateur.ipynb b/generateur.ipynb index 85ce87b..0321d14 100644 --- a/generateur.ipynb +++ b/generateur.ipynb @@ -23,7 +23,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 45, "metadata": { "collapsed": true }, @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 34, "metadata": { "collapsed": false }, @@ -73,22 +73,11 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": null, "metadata": { "collapsed": false }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", - "Wall time: 11.4 µs\n", - "CPU times: user 596 ms, sys: 4 ms, total: 600 ms\n", - "Wall time: 606 ms\n" - ] - } - ], + "outputs": [], "source": [ "%time mots_a = with_a(mots)\n", "mots_big = mots * 1000000\n", @@ -108,7 +97,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 46, "metadata": { "collapsed": true }, @@ -125,7 +114,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 47, "metadata": { "collapsed": false }, @@ -134,15 +123,15 @@ "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 0 ns, sys: 4 ms, total: 4 ms\n", - "Wall time: 7.09 ms\n", "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", - "Wall time: 13.8 µs\n" + "Wall time: 76.8 µs\n", + "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", + "Wall time: 8.82 µs\n" ] } ], "source": [ - "mots_big = mots * 1000\n", + "mots_big = mots * 100\n", "%time mots_a = with_a(mots_big)\n", "%time mots_a_gen = gen_with_a(mots_big)" ] @@ -157,7 +146,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 48, "metadata": { "collapsed": false }, @@ -167,13 +156,18 @@ "output_type": "stream", "text": [ "mots_a is a \n", - "mots_a_gen is a \n" + "mots_a_gen is a \n", + "Taille de mots_a : 1672\n", + "Taille de mots_a_gen : 88\n" ] } ], "source": [ "print(\"mots_a is a {}\".format(type(mots_a)))\n", - "print(\"mots_a_gen is a {}\".format(type(mots_a_gen)))" + "print(\"mots_a_gen is a {}\".format(type(mots_a_gen)))\n", + "import sys\n", + "print(\"Taille de mots_a : {}\".format(sys.getsizeof(mots_a)))\n", + "print(\"Taille de mots_a_gen : {}\".format(sys.getsizeof(mots_a_gen)))" ] }, { @@ -188,7 +182,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 18, "metadata": { "collapsed": false }, @@ -197,8 +191,8 @@ "name": "stdout", "output_type": "stream", "text": [ - "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n", - "Wall time: 568 µs\n" + "CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n", + "Wall time: 5.68 ms\n" ] } ], @@ -224,7 +218,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 24, "metadata": { "collapsed": false }, @@ -232,16 +226,38 @@ { "data": { "text/plain": [ - " at 0x7f0eac423db0>" + "['chat', 'matin']" ] }, - "execution_count": 22, + "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "(mot for mots in mots if 'a' in mot)" + "[mot for mot in mots if 'a' in mot]" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + " at 0x7fd2a44637d8>" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "(mot for mot in mots if 'a' in mot)" ] }, { @@ -259,18 +275,6 @@ "display_name": "Python 3", "language": "python", "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.2+" } }, "nbformat": 4,