From b7f53956836c44094dc88cdd61bc4d957cd23ce6 Mon Sep 17 00:00:00 2001 From: Beatriz Milz Date: Tue, 30 Jul 2024 09:31:45 -0300 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- data-transform.html | 2 +- datetimes.html | 12 ++++++------ iteration.html | 12 ++++++------ search.json | 8 ++++---- sitemap.xml | 2 +- 6 files changed, 19 insertions(+), 19 deletions(-) diff --git a/.nojekyll b/.nojekyll index c8e8a624a..d2937871f 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -f815185e \ No newline at end of file +3446940a \ No newline at end of file diff --git a/data-transform.html b/data-transform.html index b27363aad..9e535cc88 100644 --- a/data-transform.html +++ b/data-transform.html @@ -1266,7 +1266,7 @@

  • Qual companhia aérea (companhia_aerea) possui a pior média de atrasos? Desafio: você consegue desvendar os efeitos de aeroportos ruins versus companhias aéreas ruins? Por que sim ou por que não? (Dica: experimente usar voos |> group_by(companhia_aerea, destino) |> summarize(n()))

  • Ache os vôos que estão mais atrasados no momento da decolagem, a partir de cada destino.

  • -
  • Como os atrasaso variam ao longo do dia. Ilustre sua resposta com um gráfico.

  • +
  • Como os atrasos variam ao longo do dia. Ilustre sua resposta com um gráfico.

  • O que acontece se você passar um n negativo para slice_min() e funções similares?

  • Explique o que count() faz em termos dos verbos dplyr que você acabou de aprender. O que o argumento sort faz para a função count()?

  • diff --git a/datetimes.html b/datetimes.html index 2846ff203..28758d68f 100644 --- a/datetimes.html +++ b/datetimes.html @@ -492,9 +492,9 @@

    today() or now():

    today()
    -#> [1] "2024-07-29"
    +#> [1] "2024-07-30"
     now()
    -#> [1] "2024-07-29 10:53:09 -03"
    +#> [1] "2024-07-30 09:30:14 -03"

    Otherwise, the following sections describe the four ways you’re likely to create a date/time:

      @@ -782,9 +782,9 @@

      as_datetime() and as_date():

      as_datetime(today())
      -#> [1] "2024-07-29 UTC"
      +#> [1] "2024-07-30 UTC"
       as_date(now())
      -#> [1] "2024-07-29"
      +#> [1] "2024-07-30"

      Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().

      @@ -1010,12 +1010,12 @@

      # How old is Hadley?
       h_age <- today() - ymd("1979-10-14")
       h_age
      -#> Time difference of 16360 days

      +#> Time difference of 16361 days

      A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.

      as.duration(h_age)
      -#> [1] "1413504000s (~44.79 years)"
      +#> [1] "1413590400s (~44.79 years)"

      Durations come with a bunch of convenient constructors:

      diff --git a/iteration.html b/iteration.html index 36eadbf01..f9ab31e73 100644 --- a/iteration.html +++ b/iteration.html @@ -1312,12 +1312,12 @@

      #> # Database: DuckDB v0.10.0 [root@Darwin 21.6.0:R 4.3.3/:memory:] #> year n #> <dbl> <dbl> -#> 1 1977 142 -#> 2 1987 142 -#> 3 1967 142 -#> 4 2007 142 -#> 5 1952 142 -#> 6 1962 142 +#> 1 1952 142 +#> 2 1957 142 +#> 3 1972 142 +#> 4 1977 142 +#> 5 1987 142 +#> 6 1967 142 #> # ℹ more rows

      diff --git a/search.json b/search.json index b0817db09..ad421ed80 100644 --- a/search.json +++ b/search.json @@ -355,7 +355,7 @@ "href": "data-transform.html#grupos-groups", "title": "3  ✅ Transformação de dados", "section": "\n3.5 Grupos (groups)", - "text": "3.5 Grupos (groups)\nAté aqui você aprendeu sobre funções que trabalham com linhas e colunas. dplyr se torna ainda mais poderoso quando você acrescenta a habilidade de trabalhar com grupos. Nessa seção, focaremos nas funções mais importantes: group_by(), summarize(), e a família de funções slice.\n\n3.5.1 group_by()\n\nUtilize group_by() para dividir o seu conjunto de dados em grupos que tenham algum significado para sua análise:\n\nvoos |> \n group_by(mes)\n#> # A tibble: 336,776 × 19\n#> # Groups: mes [12]\n#> ano mes dia horario_saida saida_programada atraso_saida\n#> <int> <int> <int> <int> <int> <dbl>\n#> 1 2013 1 1 517 515 2\n#> 2 2013 1 1 533 529 4\n#> 3 2013 1 1 542 540 2\n#> 4 2013 1 1 544 545 -1\n#> 5 2013 1 1 554 600 -6\n#> 6 2013 1 1 554 558 -4\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\ngroup_by() não modifica seus dados mas, se você observar a saída de perto, você verá que a há uma indicação de que ela está agrupada (“grouped by”) mês (Groups: mes [12]). Isso significa que todas as operações subsequentes irão operar “por mês”. group_by() adiciona essa funcionalidade agrupada (“grouped”) (referenciada como uma classe) ao data frame, que modifica o comportamento dos verbos subsequentes que são aplicados aos dados.\n\n3.5.2 summarize()\n\nA operação agrupada mais importante é a sumarização (“summary”), que caso esteja sendo utilizada para calcular uma única sumarização estatística, reduz o data frame para ter apenas uma única linha para cada grupo.\nNo dplyr, essa operação é feita pela função summarize()5, como mostrado pelo exemplo a seguir, que calcula o atraso médio das decolagens por mês:\n\nvoos |> \n group_by(mes) |> \n summarize(\n atraso_medio = mean(atraso_saida)\n )\n#> # A tibble: 12 × 2\n#> mes atraso_medio\n#> <int> <dbl>\n#> 1 1 NA\n#> 2 2 NA\n#> 3 3 NA\n#> 4 4 NA\n#> 5 5 NA\n#> 6 6 NA\n#> # ℹ 6 more rows\n\nUow! Alguma coisa deu errado e todos os nosso resultados viraram NAs (pronuncia-se “N-A”), que é o símbolo do R para valores ausentes. Isso ocorreu pois alguns dos vôos observados possuíam dados ausentes na coluna de atrasos, e por isso, quando calculamos a média incluindo esses valores, obtemos um NA como resultado. Voltaremos a falar de valores ausentes em detalhes em Capítulo 18, mas por enquanto, vamos pedir à função mean() para ignorar todos os valores ausentes definindo o argumento na.rm como TRUE:\n\nvoos |> \n group_by(mes) |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE)\n )\n#> # A tibble: 12 × 2\n#> mes atraso\n#> <int> <dbl>\n#> 1 1 10.0\n#> 2 2 10.8\n#> 3 3 13.2\n#> 4 4 13.9\n#> 5 5 13.0\n#> 6 6 20.8\n#> # ℹ 6 more rows\n\nVocê pode criar quantas sínteses quiser em uma única chamada à summarize(). Você irá aprender várias sumarizações úteis nos próximos capítulos, mas uma que é muito útil é n(), que retorna o número de linhas de cada grupo.\n\nvoos |> \n group_by(mes) |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE), \n n = n()\n )\n#> # A tibble: 12 × 3\n#> mes atraso n\n#> <int> <dbl> <int>\n#> 1 1 10.0 27004\n#> 2 2 10.8 24951\n#> 3 3 13.2 28834\n#> 4 4 13.9 28330\n#> 5 5 13.0 28796\n#> 6 6 20.8 28243\n#> # ℹ 6 more rows\n\nSurpreendentemente, médias e contagens podem te levar longe em ciência de dados!\n\n3.5.3 As funções slice_\n\nExistem cinco funções úteis que lhe permite extrair linhas específicas de dentro de cada grupo:\n\n\ndf |> slice_head(n = 1) pega a primeira linha de cada grupo.\n\ndf |> slice_tail(n = 1) pega a última linha de cada grupo.\n\ndf |> slice_min(x, n = 1) pega a linha com o menor valor da coluna x.\n\ndf |> slice_max(x, n = 1) pega a linha com o maior valor da coluna x.\n\ndf |> slice_sample(n = 1) pega uma linha aleatória.\n\nVocê pode variar n para selecionar mais do que uma linha, ou, em vez de usar n =, você pode usar prop = 0.1 para selecionar (por exemplo) 10% das linhas de cada grupo. Por exemplo, o código a seguir acha os vôos que estão mais atrasados na chegada em cada destino.\n\nvoos |> \n group_by(destino) |> \n slice_max(atraso_chegada, n = 1) |>\n relocate(destino)\n#> # A tibble: 108 × 19\n#> # Groups: destino [105]\n#> destino ano mes dia horario_saida saida_programada atraso_saida\n#> <chr> <int> <int> <int> <int> <int> <dbl>\n#> 1 ABQ 2013 7 22 2145 2007 98\n#> 2 ACK 2013 7 23 1139 800 219\n#> 3 ALB 2013 1 25 123 2000 323\n#> 4 ANC 2013 8 17 1740 1625 75\n#> 5 ATL 2013 7 22 2257 759 898\n#> 6 AUS 2013 7 10 2056 1505 351\n#> # ℹ 102 more rows\n#> # ℹ 12 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\nPerceba que existem 105 destinos, mas obtivemos 108 linhas aqui. O que aconteceu? slice_min() e slice_max() mantém os valores empatados, então n = 1 significa “nos dê todas as linhas com o maior valor”. Se você quiser exatamente uma linha por grupo, você pode definir with_ties = FALSE.\nIsso é similar a calcular o atraso máximo com summarize(), mas você obtém toda a linha correspondente (ou linhas, se houver um empate) em vez de apenas uma síntese estatística.\n\n3.5.4 Agrupando por múltiplas variáveis\nVocê pode criar grupos utilizando mais de uma variável. Por exemplo, podemos fazer um grupo para cada data.\n\npor_dia <- voos |> \n group_by(ano, mes, dia)\npor_dia\n#> # A tibble: 336,776 × 19\n#> # Groups: ano, mes, dia [365]\n#> ano mes dia horario_saida saida_programada atraso_saida\n#> <int> <int> <int> <int> <int> <dbl>\n#> 1 2013 1 1 517 515 2\n#> 2 2013 1 1 533 529 4\n#> 3 2013 1 1 542 540 2\n#> 4 2013 1 1 544 545 -1\n#> 5 2013 1 1 554 600 -6\n#> 6 2013 1 1 554 558 -4\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\nQuando você cria um sumário de um tibble agrupado por mais de uma variável, cada sumário remove a camada do último grupo. Na verdade, essa não era uma boa forma de fazer essa função funcionar, mas é difícil mudar agora sem quebrar códigos que já existem. Para tornar óbvio o que está acontecendo, o dplyr exibe uma mensagem que te diz como modificar esse comportamento:\n\nvoos_diarios <- por_dia |> \n summarize(n = n())\n#> `summarise()` has grouped output by 'ano', 'mes'. You can override using the\n#> `.groups` argument.\n\nSe você acha que esse comportamente está adequado, você pode requisitá-lo explicitamente para que a mensagem seja suprimida:\n\nvoos_diarios <- por_dia |> \n summarize(\n n = n(), \n .groups = \"drop_last\"\n )\n\nOu então, modifique o comportamento padrão definindo um valor diferente, por exemplo, \"drop\", para descartar todos os agrupamentos ou \"keep\" para manter os mesmos grupos.\n\n3.5.5 Desagrupando\nVocê também pode querer remover agrupamentos de um data frame sem utilizar summarize(). Você pode fazer isso com ungroup().\n\npor_dia |> \n ungroup()\n#> # A tibble: 336,776 × 19\n#> ano mes dia horario_saida saida_programada atraso_saida\n#> <int> <int> <int> <int> <int> <dbl>\n#> 1 2013 1 1 517 515 2\n#> 2 2013 1 1 533 529 4\n#> 3 2013 1 1 542 540 2\n#> 4 2013 1 1 544 545 -1\n#> 5 2013 1 1 554 600 -6\n#> 6 2013 1 1 554 558 -4\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\nAgora, vamos ver o que acontece quando você tenta sumarizar um data frame desagrupado.\n\npor_dia |> \n ungroup() |>\n summarize(\n atraso_medio = mean(atraso_saida, na.rm = TRUE), \n voos = n()\n )\n#> # A tibble: 1 × 2\n#> atraso_medio voos\n#> <dbl> <int>\n#> 1 12.6 336776\n\nVocê obtém uma única linha, pois o dplyr trata todas as linhas de um data frame desagrupado como pertencentes a um único grupo.\n\n3.5.6 .by\n\ndplyr 1.1.0 inclui uma nova e experimental sintaxe para agrupamentos por operação, trata-se do argumento .by. group_by() e ungroup() não serão abandonados, mas você pode também utilizar .by para agrupar no âmbito de uma única operação:\n\nvoos |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE), \n n = n(),\n .by = mes\n )\n\nOu, se você quiser agrupar por múltiplas variáveis:\n\nvoos |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE), \n n = n(),\n .by = c(origem, destino)\n )\n\n.by funciona com todos os verbos e tem a vantagem de você não precisar utilizar o argumento .groups para suprimir a mensagem de agrupamento ou ungroup() quando já tiver terminado a operação.\nNós não focamos nessa sintaxe nesse capítulo pois ela era muito nova quando escrevemos o livro. No entanto, quisemos mencioná-la pois achamos que ela tem muito potencial e provavelmente se tonará bastante popular. Você pode ler mais sobre ela em dplyr 1.1.0 blog post.\n\n3.5.7 Exercícios\n\nQual companhia aérea (companhia_aerea) possui a pior média de atrasos? Desafio: você consegue desvendar os efeitos de aeroportos ruins versus companhias aéreas ruins? Por que sim ou por que não? (Dica: experimente usar voos |> group_by(companhia_aerea, destino) |> summarize(n()))\nAche os vôos que estão mais atrasados no momento da decolagem, a partir de cada destino.\nComo os atrasaso variam ao longo do dia. Ilustre sua resposta com um gráfico.\nO que acontece se você passar um n negativo para slice_min() e funções similares?\nExplique o que count() faz em termos dos verbos dplyr que você acabou de aprender. O que o argumento sort faz para a função count()?\n\nSuponha que temos o pequeno data frame a seguir:\n\ndf <- tibble(\n x = 1:5,\n y = c(\"a\", \"b\", \"a\", \"a\", \"b\"),\n z = c(\"K\", \"K\", \"L\", \"L\", \"K\")\n)\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que group_by() faz:\n\ndf |>\n group_by(y)\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que arrange() faz. Comente também a respeito da diferença em relação ao group_by() da parte (a):\n\ndf |>\n arrange(y)\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que o seguinte pipeline faz:\n\ndf |>\n group_by(y) |>\n summarize(mean_x = mean(x))\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que o seguinte pipeline faz. Em seguida, comente sobre o que a mensagem diz:\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que o seguinte pipeline faz. Como a saída difere da saída da parte (d):\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x), .groups = \"drop\")\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que cada pipeline faz: Em que os dois pipelines diferem?\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\ndf |>\n group_by(y, z) |>\n mutate(mean_x = mean(x))", + "text": "3.5 Grupos (groups)\nAté aqui você aprendeu sobre funções que trabalham com linhas e colunas. dplyr se torna ainda mais poderoso quando você acrescenta a habilidade de trabalhar com grupos. Nessa seção, focaremos nas funções mais importantes: group_by(), summarize(), e a família de funções slice.\n\n3.5.1 group_by()\n\nUtilize group_by() para dividir o seu conjunto de dados em grupos que tenham algum significado para sua análise:\n\nvoos |> \n group_by(mes)\n#> # A tibble: 336,776 × 19\n#> # Groups: mes [12]\n#> ano mes dia horario_saida saida_programada atraso_saida\n#> <int> <int> <int> <int> <int> <dbl>\n#> 1 2013 1 1 517 515 2\n#> 2 2013 1 1 533 529 4\n#> 3 2013 1 1 542 540 2\n#> 4 2013 1 1 544 545 -1\n#> 5 2013 1 1 554 600 -6\n#> 6 2013 1 1 554 558 -4\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\ngroup_by() não modifica seus dados mas, se você observar a saída de perto, você verá que a há uma indicação de que ela está agrupada (“grouped by”) mês (Groups: mes [12]). Isso significa que todas as operações subsequentes irão operar “por mês”. group_by() adiciona essa funcionalidade agrupada (“grouped”) (referenciada como uma classe) ao data frame, que modifica o comportamento dos verbos subsequentes que são aplicados aos dados.\n\n3.5.2 summarize()\n\nA operação agrupada mais importante é a sumarização (“summary”), que caso esteja sendo utilizada para calcular uma única sumarização estatística, reduz o data frame para ter apenas uma única linha para cada grupo.\nNo dplyr, essa operação é feita pela função summarize()5, como mostrado pelo exemplo a seguir, que calcula o atraso médio das decolagens por mês:\n\nvoos |> \n group_by(mes) |> \n summarize(\n atraso_medio = mean(atraso_saida)\n )\n#> # A tibble: 12 × 2\n#> mes atraso_medio\n#> <int> <dbl>\n#> 1 1 NA\n#> 2 2 NA\n#> 3 3 NA\n#> 4 4 NA\n#> 5 5 NA\n#> 6 6 NA\n#> # ℹ 6 more rows\n\nUow! Alguma coisa deu errado e todos os nosso resultados viraram NAs (pronuncia-se “N-A”), que é o símbolo do R para valores ausentes. Isso ocorreu pois alguns dos vôos observados possuíam dados ausentes na coluna de atrasos, e por isso, quando calculamos a média incluindo esses valores, obtemos um NA como resultado. Voltaremos a falar de valores ausentes em detalhes em Capítulo 18, mas por enquanto, vamos pedir à função mean() para ignorar todos os valores ausentes definindo o argumento na.rm como TRUE:\n\nvoos |> \n group_by(mes) |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE)\n )\n#> # A tibble: 12 × 2\n#> mes atraso\n#> <int> <dbl>\n#> 1 1 10.0\n#> 2 2 10.8\n#> 3 3 13.2\n#> 4 4 13.9\n#> 5 5 13.0\n#> 6 6 20.8\n#> # ℹ 6 more rows\n\nVocê pode criar quantas sínteses quiser em uma única chamada à summarize(). Você irá aprender várias sumarizações úteis nos próximos capítulos, mas uma que é muito útil é n(), que retorna o número de linhas de cada grupo.\n\nvoos |> \n group_by(mes) |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE), \n n = n()\n )\n#> # A tibble: 12 × 3\n#> mes atraso n\n#> <int> <dbl> <int>\n#> 1 1 10.0 27004\n#> 2 2 10.8 24951\n#> 3 3 13.2 28834\n#> 4 4 13.9 28330\n#> 5 5 13.0 28796\n#> 6 6 20.8 28243\n#> # ℹ 6 more rows\n\nSurpreendentemente, médias e contagens podem te levar longe em ciência de dados!\n\n3.5.3 As funções slice_\n\nExistem cinco funções úteis que lhe permite extrair linhas específicas de dentro de cada grupo:\n\n\ndf |> slice_head(n = 1) pega a primeira linha de cada grupo.\n\ndf |> slice_tail(n = 1) pega a última linha de cada grupo.\n\ndf |> slice_min(x, n = 1) pega a linha com o menor valor da coluna x.\n\ndf |> slice_max(x, n = 1) pega a linha com o maior valor da coluna x.\n\ndf |> slice_sample(n = 1) pega uma linha aleatória.\n\nVocê pode variar n para selecionar mais do que uma linha, ou, em vez de usar n =, você pode usar prop = 0.1 para selecionar (por exemplo) 10% das linhas de cada grupo. Por exemplo, o código a seguir acha os vôos que estão mais atrasados na chegada em cada destino.\n\nvoos |> \n group_by(destino) |> \n slice_max(atraso_chegada, n = 1) |>\n relocate(destino)\n#> # A tibble: 108 × 19\n#> # Groups: destino [105]\n#> destino ano mes dia horario_saida saida_programada atraso_saida\n#> <chr> <int> <int> <int> <int> <int> <dbl>\n#> 1 ABQ 2013 7 22 2145 2007 98\n#> 2 ACK 2013 7 23 1139 800 219\n#> 3 ALB 2013 1 25 123 2000 323\n#> 4 ANC 2013 8 17 1740 1625 75\n#> 5 ATL 2013 7 22 2257 759 898\n#> 6 AUS 2013 7 10 2056 1505 351\n#> # ℹ 102 more rows\n#> # ℹ 12 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\nPerceba que existem 105 destinos, mas obtivemos 108 linhas aqui. O que aconteceu? slice_min() e slice_max() mantém os valores empatados, então n = 1 significa “nos dê todas as linhas com o maior valor”. Se você quiser exatamente uma linha por grupo, você pode definir with_ties = FALSE.\nIsso é similar a calcular o atraso máximo com summarize(), mas você obtém toda a linha correspondente (ou linhas, se houver um empate) em vez de apenas uma síntese estatística.\n\n3.5.4 Agrupando por múltiplas variáveis\nVocê pode criar grupos utilizando mais de uma variável. Por exemplo, podemos fazer um grupo para cada data.\n\npor_dia <- voos |> \n group_by(ano, mes, dia)\npor_dia\n#> # A tibble: 336,776 × 19\n#> # Groups: ano, mes, dia [365]\n#> ano mes dia horario_saida saida_programada atraso_saida\n#> <int> <int> <int> <int> <int> <dbl>\n#> 1 2013 1 1 517 515 2\n#> 2 2013 1 1 533 529 4\n#> 3 2013 1 1 542 540 2\n#> 4 2013 1 1 544 545 -1\n#> 5 2013 1 1 554 600 -6\n#> 6 2013 1 1 554 558 -4\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\nQuando você cria um sumário de um tibble agrupado por mais de uma variável, cada sumário remove a camada do último grupo. Na verdade, essa não era uma boa forma de fazer essa função funcionar, mas é difícil mudar agora sem quebrar códigos que já existem. Para tornar óbvio o que está acontecendo, o dplyr exibe uma mensagem que te diz como modificar esse comportamento:\n\nvoos_diarios <- por_dia |> \n summarize(n = n())\n#> `summarise()` has grouped output by 'ano', 'mes'. You can override using the\n#> `.groups` argument.\n\nSe você acha que esse comportamente está adequado, você pode requisitá-lo explicitamente para que a mensagem seja suprimida:\n\nvoos_diarios <- por_dia |> \n summarize(\n n = n(), \n .groups = \"drop_last\"\n )\n\nOu então, modifique o comportamento padrão definindo um valor diferente, por exemplo, \"drop\", para descartar todos os agrupamentos ou \"keep\" para manter os mesmos grupos.\n\n3.5.5 Desagrupando\nVocê também pode querer remover agrupamentos de um data frame sem utilizar summarize(). Você pode fazer isso com ungroup().\n\npor_dia |> \n ungroup()\n#> # A tibble: 336,776 × 19\n#> ano mes dia horario_saida saida_programada atraso_saida\n#> <int> <int> <int> <int> <int> <dbl>\n#> 1 2013 1 1 517 515 2\n#> 2 2013 1 1 533 529 4\n#> 3 2013 1 1 542 540 2\n#> 4 2013 1 1 544 545 -1\n#> 5 2013 1 1 554 600 -6\n#> 6 2013 1 1 554 558 -4\n#> # ℹ 336,770 more rows\n#> # ℹ 13 more variables: horario_chegada <int>, chegada_prevista <int>, …\n\nAgora, vamos ver o que acontece quando você tenta sumarizar um data frame desagrupado.\n\npor_dia |> \n ungroup() |>\n summarize(\n atraso_medio = mean(atraso_saida, na.rm = TRUE), \n voos = n()\n )\n#> # A tibble: 1 × 2\n#> atraso_medio voos\n#> <dbl> <int>\n#> 1 12.6 336776\n\nVocê obtém uma única linha, pois o dplyr trata todas as linhas de um data frame desagrupado como pertencentes a um único grupo.\n\n3.5.6 .by\n\ndplyr 1.1.0 inclui uma nova e experimental sintaxe para agrupamentos por operação, trata-se do argumento .by. group_by() e ungroup() não serão abandonados, mas você pode também utilizar .by para agrupar no âmbito de uma única operação:\n\nvoos |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE), \n n = n(),\n .by = mes\n )\n\nOu, se você quiser agrupar por múltiplas variáveis:\n\nvoos |> \n summarize(\n atraso = mean(atraso_saida, na.rm = TRUE), \n n = n(),\n .by = c(origem, destino)\n )\n\n.by funciona com todos os verbos e tem a vantagem de você não precisar utilizar o argumento .groups para suprimir a mensagem de agrupamento ou ungroup() quando já tiver terminado a operação.\nNós não focamos nessa sintaxe nesse capítulo pois ela era muito nova quando escrevemos o livro. No entanto, quisemos mencioná-la pois achamos que ela tem muito potencial e provavelmente se tonará bastante popular. Você pode ler mais sobre ela em dplyr 1.1.0 blog post.\n\n3.5.7 Exercícios\n\nQual companhia aérea (companhia_aerea) possui a pior média de atrasos? Desafio: você consegue desvendar os efeitos de aeroportos ruins versus companhias aéreas ruins? Por que sim ou por que não? (Dica: experimente usar voos |> group_by(companhia_aerea, destino) |> summarize(n()))\nAche os vôos que estão mais atrasados no momento da decolagem, a partir de cada destino.\nComo os atrasos variam ao longo do dia. Ilustre sua resposta com um gráfico.\nO que acontece se você passar um n negativo para slice_min() e funções similares?\nExplique o que count() faz em termos dos verbos dplyr que você acabou de aprender. O que o argumento sort faz para a função count()?\n\nSuponha que temos o pequeno data frame a seguir:\n\ndf <- tibble(\n x = 1:5,\n y = c(\"a\", \"b\", \"a\", \"a\", \"b\"),\n z = c(\"K\", \"K\", \"L\", \"L\", \"K\")\n)\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que group_by() faz:\n\ndf |>\n group_by(y)\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que arrange() faz. Comente também a respeito da diferença em relação ao group_by() da parte (a):\n\ndf |>\n arrange(y)\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que o seguinte pipeline faz:\n\ndf |>\n group_by(y) |>\n summarize(mean_x = mean(x))\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que o seguinte pipeline faz. Em seguida, comente sobre o que a mensagem diz:\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que o seguinte pipeline faz. Como a saída difere da saída da parte (d):\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x), .groups = \"drop\")\n\n\n\nEscreva como você acha que será a saída, e em seguida confira se acertou e descreva o que cada pipeline faz: Em que os dois pipelines diferem?\n\ndf |>\n group_by(y, z) |>\n summarize(mean_x = mean(x))\n\ndf |>\n group_by(y, z) |>\n mutate(mean_x = mean(x))", "crumbs": [ "✅ Visão geral", "3  ✅ Transformação de dados" @@ -1629,7 +1629,7 @@ "href": "datetimes.html#sec-creating-datetimes", "title": "17  Dates and times", "section": "\n17.2 Creating date/times", - "text": "17.2 Creating date/times\nThere are three types of date/time data that refer to an instant in time:\n\nA date. Tibbles print this as <date>.\nA time within a day. Tibbles print this as <time>.\nA date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.\n\nIn this chapter we are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.\nYou should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.\nTo get the current date or date-time you can use today() or now():\n\ntoday()\n#> [1] \"2024-07-29\"\nnow()\n#> [1] \"2024-07-29 10:53:09 -03\"\n\nOtherwise, the following sections describe the four ways you’re likely to create a date/time:\n\nWhile reading a file with readr.\nFrom a string.\nFrom individual date-time components.\nFrom an existing date/time object.\n\n\n17.2.1 During import\nIf your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:\n\ncsv <- \"\n date,datetime\n 2022-01-02,2022-01-02 05:12\n\"\nread_csv(csv)\n#> # A tibble: 1 × 2\n#> date datetime \n#> <date> <dttm> \n#> 1 2022-01-02 2022-01-02 05:12:00\n\nIf you haven’t heard of ISO8601 before, it’s an international standard2 for writing dates where the components of a date are organized from biggest to smallest separated by -. For example, in ISO8601 May 3 2022 is 2022-05-03. ISO8601 dates can also include times, where hour, minute, and second are separated by :, and the date and time components are separated by either a T or a space. For example, you could write 4:26pm on May 3 2022 as either 2022-05-03 16:26 or 2022-05-03T16:26.\nFor other date-time formats, you’ll need to use col_types plus col_date() or col_datetime() along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a % followed by a single character. For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day. Table Tabela 17.1 lists all the options.\n\n\nTabela 17.1: All date formats understood by readr\n\n\n\nType\nCode\nMeaning\nExample\n\n\n\nYear\n%Y\n4 digit year\n2021\n\n\n\n%y\n2 digit year\n21\n\n\nMonth\n%m\nNumber\n2\n\n\n\n%b\nAbbreviated name\nFeb\n\n\n\n%B\nFull name\nFebruary\n\n\nDay\n%d\nOne or two digits\n2\n\n\n\n%e\nTwo digits\n02\n\n\nTime\n%H\n24-hour hour\n13\n\n\n\n%I\n12-hour hour\n1\n\n\n\n%p\nAM/PM\npm\n\n\n\n%M\nMinutes\n35\n\n\n\n%S\nSeconds\n45\n\n\n\n%OS\nSeconds with decimal component\n45.35\n\n\n\n%Z\nTime zone name\nAmerica/Chicago\n\n\n\n%z\nOffset from UTC\n+0800\n\n\nOther\n%.\nSkip one non-digit\n:\n\n\n\n%*\nSkip any number of non-digits\n\n\n\n\n\n\n\nAnd this code shows a few options applied to a very ambiguous date:\n\ncsv <- \"\n date\n 01/02/15\n\"\n\nread_csv(csv, col_types = cols(date = col_date(\"%m/%d/%y\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2015-01-02\n\nread_csv(csv, col_types = cols(date = col_date(\"%d/%m/%y\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2015-02-01\n\nread_csv(csv, col_types = cols(date = col_date(\"%y/%m/%d\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2001-02-15\n\nNote that no matter how you specify the date format, it’s always displayed the same way once you get it into R.\nIf you’re using %b or %B and working with non-English dates, you’ll also need to provide a locale(). See the list of built-in languages in date_names_langs(), or create your own with date_names(),\n\n17.2.2 From strings\nThe date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:\n\nymd(\"2017-01-31\")\n#> [1] \"2017-01-31\"\nmdy(\"January 31st, 2017\")\n#> [1] \"2017-01-31\"\ndmy(\"31-Jan-2017\")\n#> [1] \"2017-01-31\"\n\nymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:\n\nymd_hms(\"2017-01-31 20:11:59\")\n#> [1] \"2017-01-31 20:11:59 UTC\"\nmdy_hm(\"01/31/2017 08:01\")\n#> [1] \"2017-01-31 08:01:00 UTC\"\n\nYou can also force the creation of a date-time from a date by supplying a timezone:\n\nymd(\"2017-01-31\", tz = \"UTC\")\n#> [1] \"2017-01-31 UTC\"\n\nHere I use the UTC3 timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude4 . It doesn’t use daylight saving time, making it a bit easier to compute with .\n\n17.2.3 From individual components\nInstead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:\n\nflights |> \n select(year, month, day, hour, minute)\n#> # A tibble: 336,776 × 5\n#> year month day hour minute\n#> <int> <int> <int> <dbl> <dbl>\n#> 1 2013 1 1 5 15\n#> 2 2013 1 1 5 29\n#> 3 2013 1 1 5 40\n#> 4 2013 1 1 5 45\n#> 5 2013 1 1 6 0\n#> 6 2013 1 1 5 58\n#> # ℹ 336,770 more rows\n\nTo create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:\n\nflights |> \n select(year, month, day, hour, minute) |> \n mutate(departure = make_datetime(year, month, day, hour, minute))\n#> # A tibble: 336,776 × 6\n#> year month day hour minute departure \n#> <int> <int> <int> <dbl> <dbl> <dttm> \n#> 1 2013 1 1 5 15 2013-01-01 05:15:00\n#> 2 2013 1 1 5 29 2013-01-01 05:29:00\n#> 3 2013 1 1 5 40 2013-01-01 05:40:00\n#> 4 2013 1 1 5 45 2013-01-01 05:45:00\n#> 5 2013 1 1 6 0 2013-01-01 06:00:00\n#> 6 2013 1 1 5 58 2013-01-01 05:58:00\n#> # ℹ 336,770 more rows\n\nLet’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.\n\nmake_datetime_100 <- function(year, month, day, time) {\n make_datetime(year, month, day, time %/% 100, time %% 100)\n}\n\nflights_dt <- flights |> \n filter(!is.na(dep_time), !is.na(arr_time)) |> \n mutate(\n dep_time = make_datetime_100(year, month, day, dep_time),\n arr_time = make_datetime_100(year, month, day, arr_time),\n sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),\n sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)\n ) |> \n select(origin, dest, ends_with(\"delay\"), ends_with(\"time\"))\n\nflights_dt\n#> # A tibble: 328,063 × 9\n#> origin dest dep_delay arr_delay dep_time sched_dep_time \n#> <chr> <chr> <dbl> <dbl> <dttm> <dttm> \n#> 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00\n#> 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00\n#> 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00\n#> 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00\n#> 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00\n#> 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00\n#> # ℹ 328,057 more rows\n#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …\n\nWith this data, we can visualize the distribution of departure times across the year:\n\nflights_dt |> \n ggplot(aes(x = dep_time)) + \n geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day\n\n\n\n\n\n\n\nOr within a single day:\n\nflights_dt |> \n filter(dep_time < ymd(20130102)) |> \n ggplot(aes(x = dep_time)) + \n geom_freqpoly(binwidth = 600) # 600 s = 10 minutes\n\n\n\n\n\n\n\nNote that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.\n\n17.2.4 From other types\nYou may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():\n\nas_datetime(today())\n#> [1] \"2024-07-29 UTC\"\nas_date(now())\n#> [1] \"2024-07-29\"\n\nSometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().\n\nas_datetime(60 * 60 * 10)\n#> [1] \"1970-01-01 10:00:00 UTC\"\nas_date(365 * 10 + 2)\n#> [1] \"1980-01-01\"\n\n\n17.2.5 Exercises\n\n\nWhat happens if you parse a string that contains invalid dates?\n\nymd(c(\"2010-10-10\", \"bananas\"))\n\n\nWhat does the tzone argument to today() do? Why is it important?\n\nFor each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.\n\nd1 <- \"January 1, 2010\"\nd2 <- \"2015-Mar-07\"\nd3 <- \"06-Jun-2017\"\nd4 <- c(\"August 19 (2015)\", \"July 1 (2015)\")\nd5 <- \"12/30/14\" # Dec 30, 2014\nt1 <- \"1705\"\nt2 <- \"11:15:10.12 PM\"", + "text": "17.2 Creating date/times\nThere are three types of date/time data that refer to an instant in time:\n\nA date. Tibbles print this as <date>.\nA time within a day. Tibbles print this as <time>.\nA date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.\n\nIn this chapter we are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.\nYou should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.\nTo get the current date or date-time you can use today() or now():\n\ntoday()\n#> [1] \"2024-07-30\"\nnow()\n#> [1] \"2024-07-30 09:30:14 -03\"\n\nOtherwise, the following sections describe the four ways you’re likely to create a date/time:\n\nWhile reading a file with readr.\nFrom a string.\nFrom individual date-time components.\nFrom an existing date/time object.\n\n\n17.2.1 During import\nIf your CSV contains an ISO8601 date or date-time, you don’t need to do anything; readr will automatically recognize it:\n\ncsv <- \"\n date,datetime\n 2022-01-02,2022-01-02 05:12\n\"\nread_csv(csv)\n#> # A tibble: 1 × 2\n#> date datetime \n#> <date> <dttm> \n#> 1 2022-01-02 2022-01-02 05:12:00\n\nIf you haven’t heard of ISO8601 before, it’s an international standard2 for writing dates where the components of a date are organized from biggest to smallest separated by -. For example, in ISO8601 May 3 2022 is 2022-05-03. ISO8601 dates can also include times, where hour, minute, and second are separated by :, and the date and time components are separated by either a T or a space. For example, you could write 4:26pm on May 3 2022 as either 2022-05-03 16:26 or 2022-05-03T16:26.\nFor other date-time formats, you’ll need to use col_types plus col_date() or col_datetime() along with a date-time format. The date-time format used by readr is a standard used across many programming languages, describing a date component with a % followed by a single character. For example, %Y-%m-%d specifies a date that’s a year, -, month (as number) -, day. Table Tabela 17.1 lists all the options.\n\n\nTabela 17.1: All date formats understood by readr\n\n\n\nType\nCode\nMeaning\nExample\n\n\n\nYear\n%Y\n4 digit year\n2021\n\n\n\n%y\n2 digit year\n21\n\n\nMonth\n%m\nNumber\n2\n\n\n\n%b\nAbbreviated name\nFeb\n\n\n\n%B\nFull name\nFebruary\n\n\nDay\n%d\nOne or two digits\n2\n\n\n\n%e\nTwo digits\n02\n\n\nTime\n%H\n24-hour hour\n13\n\n\n\n%I\n12-hour hour\n1\n\n\n\n%p\nAM/PM\npm\n\n\n\n%M\nMinutes\n35\n\n\n\n%S\nSeconds\n45\n\n\n\n%OS\nSeconds with decimal component\n45.35\n\n\n\n%Z\nTime zone name\nAmerica/Chicago\n\n\n\n%z\nOffset from UTC\n+0800\n\n\nOther\n%.\nSkip one non-digit\n:\n\n\n\n%*\nSkip any number of non-digits\n\n\n\n\n\n\n\nAnd this code shows a few options applied to a very ambiguous date:\n\ncsv <- \"\n date\n 01/02/15\n\"\n\nread_csv(csv, col_types = cols(date = col_date(\"%m/%d/%y\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2015-01-02\n\nread_csv(csv, col_types = cols(date = col_date(\"%d/%m/%y\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2015-02-01\n\nread_csv(csv, col_types = cols(date = col_date(\"%y/%m/%d\")))\n#> # A tibble: 1 × 1\n#> date \n#> <date> \n#> 1 2001-02-15\n\nNote that no matter how you specify the date format, it’s always displayed the same way once you get it into R.\nIf you’re using %b or %B and working with non-English dates, you’ll also need to provide a locale(). See the list of built-in languages in date_names_langs(), or create your own with date_names(),\n\n17.2.2 From strings\nThe date-time specification language is powerful, but requires careful analysis of the date format. An alternative approach is to use lubridate’s helpers which attempt to automatically determine the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date. For example:\n\nymd(\"2017-01-31\")\n#> [1] \"2017-01-31\"\nmdy(\"January 31st, 2017\")\n#> [1] \"2017-01-31\"\ndmy(\"31-Jan-2017\")\n#> [1] \"2017-01-31\"\n\nymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:\n\nymd_hms(\"2017-01-31 20:11:59\")\n#> [1] \"2017-01-31 20:11:59 UTC\"\nmdy_hm(\"01/31/2017 08:01\")\n#> [1] \"2017-01-31 08:01:00 UTC\"\n\nYou can also force the creation of a date-time from a date by supplying a timezone:\n\nymd(\"2017-01-31\", tz = \"UTC\")\n#> [1] \"2017-01-31 UTC\"\n\nHere I use the UTC3 timezone which you might also know as GMT, or Greenwich Mean Time, the time at 0° longitude4 . It doesn’t use daylight saving time, making it a bit easier to compute with .\n\n17.2.3 From individual components\nInstead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:\n\nflights |> \n select(year, month, day, hour, minute)\n#> # A tibble: 336,776 × 5\n#> year month day hour minute\n#> <int> <int> <int> <dbl> <dbl>\n#> 1 2013 1 1 5 15\n#> 2 2013 1 1 5 29\n#> 3 2013 1 1 5 40\n#> 4 2013 1 1 5 45\n#> 5 2013 1 1 6 0\n#> 6 2013 1 1 5 58\n#> # ℹ 336,770 more rows\n\nTo create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:\n\nflights |> \n select(year, month, day, hour, minute) |> \n mutate(departure = make_datetime(year, month, day, hour, minute))\n#> # A tibble: 336,776 × 6\n#> year month day hour minute departure \n#> <int> <int> <int> <dbl> <dbl> <dttm> \n#> 1 2013 1 1 5 15 2013-01-01 05:15:00\n#> 2 2013 1 1 5 29 2013-01-01 05:29:00\n#> 3 2013 1 1 5 40 2013-01-01 05:40:00\n#> 4 2013 1 1 5 45 2013-01-01 05:45:00\n#> 5 2013 1 1 6 0 2013-01-01 06:00:00\n#> 6 2013 1 1 5 58 2013-01-01 05:58:00\n#> # ℹ 336,770 more rows\n\nLet’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components. Once we’ve created the date-time variables, we focus in on the variables we’ll explore in the rest of the chapter.\n\nmake_datetime_100 <- function(year, month, day, time) {\n make_datetime(year, month, day, time %/% 100, time %% 100)\n}\n\nflights_dt <- flights |> \n filter(!is.na(dep_time), !is.na(arr_time)) |> \n mutate(\n dep_time = make_datetime_100(year, month, day, dep_time),\n arr_time = make_datetime_100(year, month, day, arr_time),\n sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),\n sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)\n ) |> \n select(origin, dest, ends_with(\"delay\"), ends_with(\"time\"))\n\nflights_dt\n#> # A tibble: 328,063 × 9\n#> origin dest dep_delay arr_delay dep_time sched_dep_time \n#> <chr> <chr> <dbl> <dbl> <dttm> <dttm> \n#> 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00\n#> 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00\n#> 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00\n#> 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00\n#> 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00\n#> 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00\n#> # ℹ 328,057 more rows\n#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …\n\nWith this data, we can visualize the distribution of departure times across the year:\n\nflights_dt |> \n ggplot(aes(x = dep_time)) + \n geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day\n\n\n\n\n\n\n\nOr within a single day:\n\nflights_dt |> \n filter(dep_time < ymd(20130102)) |> \n ggplot(aes(x = dep_time)) + \n geom_freqpoly(binwidth = 600) # 600 s = 10 minutes\n\n\n\n\n\n\n\nNote that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.\n\n17.2.4 From other types\nYou may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date():\n\nas_datetime(today())\n#> [1] \"2024-07-30 UTC\"\nas_date(now())\n#> [1] \"2024-07-30\"\n\nSometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().\n\nas_datetime(60 * 60 * 10)\n#> [1] \"1970-01-01 10:00:00 UTC\"\nas_date(365 * 10 + 2)\n#> [1] \"1980-01-01\"\n\n\n17.2.5 Exercises\n\n\nWhat happens if you parse a string that contains invalid dates?\n\nymd(c(\"2010-10-10\", \"bananas\"))\n\n\nWhat does the tzone argument to today() do? Why is it important?\n\nFor each of the following date-times, show how you’d parse it using a readr column specification and a lubridate function.\n\nd1 <- \"January 1, 2010\"\nd2 <- \"2015-Mar-07\"\nd3 <- \"06-Jun-2017\"\nd4 <- c(\"August 19 (2015)\", \"July 1 (2015)\")\nd5 <- \"12/30/14\" # Dec 30, 2014\nt1 <- \"1705\"\nt2 <- \"11:15:10.12 PM\"", "crumbs": [ "✅ Transformar", "17  Dates and times" @@ -1651,7 +1651,7 @@ "href": "datetimes.html#time-spans", "title": "17  Dates and times", "section": "\n17.4 Time spans", - "text": "17.4 Time spans\nNext you’ll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you’ll learn about three important classes that represent time spans:\n\n\nDurations, which represent an exact number of seconds.\n\nPeriods, which represent human units like weeks and months.\n\nIntervals, which represent a starting and ending point.\n\nHow do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.\n\n17.4.1 Durations\nIn R, when you subtract two dates, you get a difftime object:\n\n# How old is Hadley?\nh_age <- today() - ymd(\"1979-10-14\")\nh_age\n#> Time difference of 16360 days\n\nA difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.\n\nas.duration(h_age)\n#> [1] \"1413504000s (~44.79 years)\"\n\nDurations come with a bunch of convenient constructors:\n\ndseconds(15)\n#> [1] \"15s\"\ndminutes(10)\n#> [1] \"600s (~10 minutes)\"\ndhours(c(12, 24))\n#> [1] \"43200s (~12 hours)\" \"86400s (~1 days)\"\nddays(0:5)\n#> [1] \"0s\" \"86400s (~1 days)\" \"172800s (~2 days)\"\n#> [4] \"259200s (~3 days)\" \"345600s (~4 days)\" \"432000s (~5 days)\"\ndweeks(3)\n#> [1] \"1814400s (~3 weeks)\"\ndyears(1)\n#> [1] \"31557600s (~1 years)\"\n\nDurations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.\nYou can add and multiply durations:\n\n2 * dyears(1)\n#> [1] \"63115200s (~2 years)\"\ndyears(1) + dweeks(12) + dhours(15)\n#> [1] \"38869200s (~1.23 years)\"\n\nYou can add and subtract durations to and from days:\n\ntomorrow <- today() + ddays(1)\nlast_year <- today() - dyears(1)\n\nHowever, because durations represent an exact number of seconds, sometimes you might get an unexpected result:\n\none_am <- ymd_hms(\"2026-03-08 01:00:00\", tz = \"America/New_York\")\n\none_am\n#> [1] \"2026-03-08 01:00:00 EST\"\none_am + ddays(1)\n#> [1] \"2026-03-09 02:00:00 EDT\"\n\nWhy is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.\n\n17.4.2 Periods\nTo solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:\n\none_am\n#> [1] \"2026-03-08 01:00:00 EST\"\none_am + days(1)\n#> [1] \"2026-03-09 01:00:00 EDT\"\n\nLike durations, periods can be created with a number of friendly constructor functions.\n\nhours(c(12, 24))\n#> [1] \"12H 0M 0S\" \"24H 0M 0S\"\ndays(7)\n#> [1] \"7d 0H 0M 0S\"\nmonths(1:6)\n#> [1] \"1m 0d 0H 0M 0S\" \"2m 0d 0H 0M 0S\" \"3m 0d 0H 0M 0S\" \"4m 0d 0H 0M 0S\"\n#> [5] \"5m 0d 0H 0M 0S\" \"6m 0d 0H 0M 0S\"\n\nYou can add and multiply periods:\n\n10 * (months(6) + days(1))\n#> [1] \"60m 10d 0H 0M 0S\"\ndays(50) + hours(25) + minutes(2)\n#> [1] \"50d 25H 2M 0S\"\n\nAnd of course, add them to dates. Compared to durations, periods are more likely to do what you expect:\n\n# A leap year\nymd(\"2024-01-01\") + dyears(1)\n#> [1] \"2024-12-31 06:00:00 UTC\"\nymd(\"2024-01-01\") + years(1)\n#> [1] \"2025-01-01\"\n\n# Daylight saving time\none_am + ddays(1)\n#> [1] \"2026-03-09 02:00:00 EDT\"\none_am + days(1)\n#> [1] \"2026-03-09 01:00:00 EDT\"\n\nLet’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.\n\nflights_dt |> \n filter(arr_time < dep_time) \n#> # A tibble: 10,633 × 9\n#> origin dest dep_delay arr_delay dep_time sched_dep_time \n#> <chr> <chr> <dbl> <dbl> <dttm> <dttm> \n#> 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00\n#> 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00\n#> 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00\n#> 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00\n#> 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00\n#> 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00\n#> # ℹ 10,627 more rows\n#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …\n\nThese are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.\n\nflights_dt <- flights_dt |> \n mutate(\n overnight = arr_time < dep_time,\n arr_time = arr_time + days(overnight),\n sched_arr_time = sched_arr_time + days(overnight)\n )\n\nNow all of our flights obey the laws of physics.\n\nflights_dt |> \n filter(arr_time < dep_time) \n#> # A tibble: 0 × 10\n#> # ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,\n#> # arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>, …\n\n\n17.4.3 Intervals\nWhat does dyears(1) / ddays(365) return? It’s not quite one, because dyears() is defined as the number of seconds per average year, which is 365.25 days.\nWhat does years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:\n\nyears(1) / days(1)\n#> [1] 365.25\n\nIf you want a more accurate measurement, you’ll have to use an interval. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.\nYou can create an interval by writing start %--% end:\n\ny2023 <- ymd(\"2023-01-01\") %--% ymd(\"2024-01-01\")\ny2024 <- ymd(\"2024-01-01\") %--% ymd(\"2025-01-01\")\n\ny2023\n#> [1] 2023-01-01 UTC--2024-01-01 UTC\ny2024\n#> [1] 2024-01-01 UTC--2025-01-01 UTC\n\nYou could then divide it by days() to find out how many days fit in the year:\n\ny2023 / days(1)\n#> [1] 365\ny2024 / days(1)\n#> [1] 366\n\n\n17.4.4 Exercises\n\nExplain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?\nCreate a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.\nWrite a function that given your birthday (as a date), returns how old you are in years.\nWhy can’t (today() %--% (today() + years(1))) / months(1) work?", + "text": "17.4 Time spans\nNext you’ll learn about how arithmetic with dates works, including subtraction, addition, and division. Along the way, you’ll learn about three important classes that represent time spans:\n\n\nDurations, which represent an exact number of seconds.\n\nPeriods, which represent human units like weeks and months.\n\nIntervals, which represent a starting and ending point.\n\nHow do you pick between duration, periods, and intervals? As always, pick the simplest data structure that solves your problem. If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.\n\n17.4.1 Durations\nIn R, when you subtract two dates, you get a difftime object:\n\n# How old is Hadley?\nh_age <- today() - ymd(\"1979-10-14\")\nh_age\n#> Time difference of 16361 days\n\nA difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds: the duration.\n\nas.duration(h_age)\n#> [1] \"1413590400s (~44.79 years)\"\n\nDurations come with a bunch of convenient constructors:\n\ndseconds(15)\n#> [1] \"15s\"\ndminutes(10)\n#> [1] \"600s (~10 minutes)\"\ndhours(c(12, 24))\n#> [1] \"43200s (~12 hours)\" \"86400s (~1 days)\"\nddays(0:5)\n#> [1] \"0s\" \"86400s (~1 days)\" \"172800s (~2 days)\"\n#> [4] \"259200s (~3 days)\" \"345600s (~4 days)\" \"432000s (~5 days)\"\ndweeks(3)\n#> [1] \"1814400s (~3 weeks)\"\ndyears(1)\n#> [1] \"31557600s (~1 years)\"\n\nDurations always record the time span in seconds. Larger units are created by converting minutes, hours, days, weeks, and years to seconds: 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 7 days in a week. Larger time units are more problematic. A year uses the “average” number of days in a year, i.e. 365.25. There’s no way to convert a month to a duration, because there’s just too much variation.\nYou can add and multiply durations:\n\n2 * dyears(1)\n#> [1] \"63115200s (~2 years)\"\ndyears(1) + dweeks(12) + dhours(15)\n#> [1] \"38869200s (~1.23 years)\"\n\nYou can add and subtract durations to and from days:\n\ntomorrow <- today() + ddays(1)\nlast_year <- today() - dyears(1)\n\nHowever, because durations represent an exact number of seconds, sometimes you might get an unexpected result:\n\none_am <- ymd_hms(\"2026-03-08 01:00:00\", tz = \"America/New_York\")\n\none_am\n#> [1] \"2026-03-08 01:00:00 EST\"\none_am + ddays(1)\n#> [1] \"2026-03-09 02:00:00 EDT\"\n\nWhy is one day after 1am March 8, 2am March 9? If you look carefully at the date you might also notice that the time zones have changed. March 8 only has 23 hours because it’s when DST starts, so if we add a full days worth of seconds we end up with a different time.\n\n17.4.2 Periods\nTo solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:\n\none_am\n#> [1] \"2026-03-08 01:00:00 EST\"\none_am + days(1)\n#> [1] \"2026-03-09 01:00:00 EDT\"\n\nLike durations, periods can be created with a number of friendly constructor functions.\n\nhours(c(12, 24))\n#> [1] \"12H 0M 0S\" \"24H 0M 0S\"\ndays(7)\n#> [1] \"7d 0H 0M 0S\"\nmonths(1:6)\n#> [1] \"1m 0d 0H 0M 0S\" \"2m 0d 0H 0M 0S\" \"3m 0d 0H 0M 0S\" \"4m 0d 0H 0M 0S\"\n#> [5] \"5m 0d 0H 0M 0S\" \"6m 0d 0H 0M 0S\"\n\nYou can add and multiply periods:\n\n10 * (months(6) + days(1))\n#> [1] \"60m 10d 0H 0M 0S\"\ndays(50) + hours(25) + minutes(2)\n#> [1] \"50d 25H 2M 0S\"\n\nAnd of course, add them to dates. Compared to durations, periods are more likely to do what you expect:\n\n# A leap year\nymd(\"2024-01-01\") + dyears(1)\n#> [1] \"2024-12-31 06:00:00 UTC\"\nymd(\"2024-01-01\") + years(1)\n#> [1] \"2025-01-01\"\n\n# Daylight saving time\none_am + ddays(1)\n#> [1] \"2026-03-09 02:00:00 EDT\"\none_am + days(1)\n#> [1] \"2026-03-09 01:00:00 EDT\"\n\nLet’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.\n\nflights_dt |> \n filter(arr_time < dep_time) \n#> # A tibble: 10,633 × 9\n#> origin dest dep_delay arr_delay dep_time sched_dep_time \n#> <chr> <chr> <dbl> <dbl> <dttm> <dttm> \n#> 1 EWR BQN 9 -4 2013-01-01 19:29:00 2013-01-01 19:20:00\n#> 2 JFK DFW 59 NA 2013-01-01 19:39:00 2013-01-01 18:40:00\n#> 3 EWR TPA -2 9 2013-01-01 20:58:00 2013-01-01 21:00:00\n#> 4 EWR SJU -6 -12 2013-01-01 21:02:00 2013-01-01 21:08:00\n#> 5 EWR SFO 11 -14 2013-01-01 21:08:00 2013-01-01 20:57:00\n#> 6 LGA FLL -10 -2 2013-01-01 21:20:00 2013-01-01 21:30:00\n#> # ℹ 10,627 more rows\n#> # ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, …\n\nThese are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.\n\nflights_dt <- flights_dt |> \n mutate(\n overnight = arr_time < dep_time,\n arr_time = arr_time + days(overnight),\n sched_arr_time = sched_arr_time + days(overnight)\n )\n\nNow all of our flights obey the laws of physics.\n\nflights_dt |> \n filter(arr_time < dep_time) \n#> # A tibble: 0 × 10\n#> # ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,\n#> # arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>, …\n\n\n17.4.3 Intervals\nWhat does dyears(1) / ddays(365) return? It’s not quite one, because dyears() is defined as the number of seconds per average year, which is 365.25 days.\nWhat does years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate:\n\nyears(1) / days(1)\n#> [1] 365.25\n\nIf you want a more accurate measurement, you’ll have to use an interval. An interval is a pair of starting and ending date times, or you can think of it as a duration with a starting point.\nYou can create an interval by writing start %--% end:\n\ny2023 <- ymd(\"2023-01-01\") %--% ymd(\"2024-01-01\")\ny2024 <- ymd(\"2024-01-01\") %--% ymd(\"2025-01-01\")\n\ny2023\n#> [1] 2023-01-01 UTC--2024-01-01 UTC\ny2024\n#> [1] 2024-01-01 UTC--2025-01-01 UTC\n\nYou could then divide it by days() to find out how many days fit in the year:\n\ny2023 / days(1)\n#> [1] 365\ny2024 / days(1)\n#> [1] 366\n\n\n17.4.4 Exercises\n\nExplain days(!overnight) and days(overnight) to someone who has just started learning R. What is the key fact you need to know?\nCreate a vector of dates giving the first day of every month in 2015. Create a vector of dates giving the first day of every month in the current year.\nWrite a function that given your birthday (as a date), returns how old you are in years.\nWhy can’t (today() %--% (today() + years(1))) / months(1) work?", "crumbs": [ "✅ Transformar", "17  Dates and times" @@ -2419,7 +2419,7 @@ "href": "iteration.html#saving-multiple-outputs", "title": "26  Iteration", "section": "\n26.4 Saving multiple outputs", - "text": "26.4 Saving multiple outputs\nIn the last section, you learned about map(), which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:\n\nSaving multiple data frames into one database.\nSaving multiple data frames into multiple .csv files.\nSaving multiple plots to multiple .png files.\n\n\n26.4.1 Writing to a database\nSometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.\nIf you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s duckdb_read_csv():\n\ncon <- DBI::dbConnect(duckdb::duckdb())\nduckdb::duckdb_read_csv(con, \"gapminder\", paths)\n\nThis would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.\nWe need to start by creating a table that will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:\n\ntemplate <- readxl::read_excel(paths[[1]])\ntemplate$year <- 1952\ntemplate\n#> # A tibble: 142 × 6\n#> country continent lifeExp pop gdpPercap year\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779. 1952\n#> 2 Albania Europe 55.2 1282697 1601. 1952\n#> 3 Algeria Africa 43.1 9279525 2449. 1952\n#> 4 Angola Africa 30.0 4232095 3521. 1952\n#> 5 Argentina Americas 62.5 17876956 5911. 1952\n#> 6 Australia Oceania 69.1 8691212 10040. 1952\n#> # ℹ 136 more rows\n\nNow we can connect to the database, and use DBI::dbCreateTable() to turn our template into a database table:\n\ncon <- DBI::dbConnect(duckdb::duckdb())\nDBI::dbCreateTable(con, \"gapminder\", template)\n\ndbCreateTable() doesn’t use the data in template, just the variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need with the types we expect:\n\ncon |> tbl(\"gapminder\")\n#> # Source: table<gapminder> [0 x 6]\n#> # Database: DuckDB v0.10.0 [root@Darwin 21.6.0:R 4.3.3/:memory:]\n#> # ℹ 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,\n#> # gdpPercap <dbl>, year <dbl>\n\nNext, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with DBI::dbAppendTable():\n\nappend_file <- function(path) {\n df <- readxl::read_excel(path)\n df$year <- parse_number(basename(path))\n \n DBI::dbAppendTable(con, \"gapminder\", df)\n}\n\nNow we need to call append_file() once for each element of paths. That’s certainly possible with map():\n\npaths |> map(append_file)\n\nBut we don’t care about the output of append_file(), so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away:\n\npaths |> walk(append_file)\n\nNow we can see if we have all the data in our table:\n\ncon |> \n tbl(\"gapminder\") |> \n count(year)\n#> # Source: SQL [?? x 2]\n#> # Database: DuckDB v0.10.0 [root@Darwin 21.6.0:R 4.3.3/:memory:]\n#> year n\n#> <dbl> <dbl>\n#> 1 1977 142\n#> 2 1987 142\n#> 3 1967 142\n#> 4 2007 142\n#> 5 1952 142\n#> 6 1962 142\n#> # ℹ more rows\n\n\n26.4.2 Writing csv files\nThe same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save one csv file for each clarity. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: group_nest().\n\nby_clarity <- diamonds |> \n group_nest(clarity)\n\nby_clarity\n#> # A tibble: 8 × 2\n#> clarity data\n#> <ord> <list<tibble[,9]>>\n#> 1 I1 [741 × 9]\n#> 2 SI2 [9,194 × 9]\n#> 3 SI1 [13,065 × 9]\n#> 4 VS2 [12,258 × 9]\n#> 5 VS1 [8,171 × 9]\n#> 6 VVS2 [5,066 × 9]\n#> # ℹ 2 more rows\n\nThis gives us a new tibble with eight rows and two columns. clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity:\n\nby_clarity$data[[1]]\n#> # A tibble: 741 × 9\n#> carat cut color depth table price x y z\n#> <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#> 1 0.32 Premium E 60.9 58 345 4.38 4.42 2.68\n#> 2 1.17 Very Good J 60.2 61 2774 6.83 6.9 4.13\n#> 3 1.01 Premium F 61.8 60 2781 6.39 6.36 3.94\n#> 4 1.01 Fair E 64.5 58 2788 6.29 6.21 4.03\n#> 5 0.96 Ideal F 60.7 55 2801 6.37 6.41 3.88\n#> 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4 \n#> # ℹ 735 more rows\n\nWhile we’re here, let’s create a column that gives the name of output file, using mutate() and str_glue():\n\nby_clarity <- by_clarity |> \n mutate(path = str_glue(\"diamonds-{clarity}.csv\"))\n\nby_clarity\n#> # A tibble: 8 × 3\n#> clarity data path \n#> <ord> <list<tibble[,9]>> <glue> \n#> 1 I1 [741 × 9] diamonds-I1.csv \n#> 2 SI2 [9,194 × 9] diamonds-SI2.csv \n#> 3 SI1 [13,065 × 9] diamonds-SI1.csv \n#> 4 VS2 [12,258 × 9] diamonds-VS2.csv \n#> 5 VS1 [8,171 × 9] diamonds-VS1.csv \n#> 6 VVS2 [5,066 × 9] diamonds-VVS2.csv\n#> # ℹ 2 more rows\n\nSo if we were going to save these data frames by hand, we might write something like:\n\nwrite_csv(by_clarity$data[[1]], by_clarity$path[[1]])\nwrite_csv(by_clarity$data[[2]], by_clarity$path[[2]])\nwrite_csv(by_clarity$data[[3]], by_clarity$path[[3]])\n...\nwrite_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])\n\nThis is a little different to our previous uses of map() because there are two arguments that are changing, not just one. That means we need a new function: map2(), which varies both the first and second arguments. And because we again don’t care about the output, we want walk2() rather than map2(). That gives us:\n\nwalk2(by_clarity$data, by_clarity$path, write_csv)\n\n\n26.4.3 Saving plots\nWe can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:\n\ncarat_histogram <- function(df) {\n ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1) \n}\n\ncarat_histogram(by_clarity$data[[1]])\n\n\n\n\n\n\n\nNow we can use map() to create a list of many plots7 and their eventual file paths:\n\nby_clarity <- by_clarity |> \n mutate(\n plot = map(data, carat_histogram),\n path = str_glue(\"clarity-{clarity}.png\")\n )\n\nThen use walk2() with ggsave() to save each plot:\n\nwalk2(\n by_clarity$path,\n by_clarity$plot,\n \\(path, plot) ggsave(path, plot, width = 6, height = 6)\n)\n\nThis is shorthand for:\n\nggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)\nggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)\nggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)\n...\nggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)", + "text": "26.4 Saving multiple outputs\nIn the last section, you learned about map(), which is useful for reading multiple files into a single object. In this section, we’ll now explore sort of the opposite problem: how can you take one or more R objects and save it to one or more files? We’ll explore this challenge using three examples:\n\nSaving multiple data frames into one database.\nSaving multiple data frames into multiple .csv files.\nSaving multiple plots to multiple .png files.\n\n\n26.4.1 Writing to a database\nSometimes when working with many files at once, it’s not possible to fit all your data into memory at once, and you can’t do map(files, read_csv). One approach to deal with this problem is to load your data into a database so you can access just the bits you need with dbplyr.\nIf you’re lucky, the database package you’re using will provide a handy function that takes a vector of paths and loads them all into the database. This is the case with duckdb’s duckdb_read_csv():\n\ncon <- DBI::dbConnect(duckdb::duckdb())\nduckdb::duckdb_read_csv(con, \"gapminder\", paths)\n\nThis would work well here, but we don’t have csv files, instead we have excel spreadsheets. So we’re going to have to do it “by hand”. Learning to do it by hand will also help you when you have a bunch of csvs and the database that you’re working with doesn’t have one function that will load them all in.\nWe need to start by creating a table that will fill in with data. The easiest way to do this is by creating a template, a dummy data frame that contains all the columns we want, but only a sampling of the data. For the gapminder data, we can make that template by reading a single file and adding the year to it:\n\ntemplate <- readxl::read_excel(paths[[1]])\ntemplate$year <- 1952\ntemplate\n#> # A tibble: 142 × 6\n#> country continent lifeExp pop gdpPercap year\n#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>\n#> 1 Afghanistan Asia 28.8 8425333 779. 1952\n#> 2 Albania Europe 55.2 1282697 1601. 1952\n#> 3 Algeria Africa 43.1 9279525 2449. 1952\n#> 4 Angola Africa 30.0 4232095 3521. 1952\n#> 5 Argentina Americas 62.5 17876956 5911. 1952\n#> 6 Australia Oceania 69.1 8691212 10040. 1952\n#> # ℹ 136 more rows\n\nNow we can connect to the database, and use DBI::dbCreateTable() to turn our template into a database table:\n\ncon <- DBI::dbConnect(duckdb::duckdb())\nDBI::dbCreateTable(con, \"gapminder\", template)\n\ndbCreateTable() doesn’t use the data in template, just the variable names and types. So if we inspect the gapminder table now you’ll see that it’s empty but it has the variables we need with the types we expect:\n\ncon |> tbl(\"gapminder\")\n#> # Source: table<gapminder> [0 x 6]\n#> # Database: DuckDB v0.10.0 [root@Darwin 21.6.0:R 4.3.3/:memory:]\n#> # ℹ 6 variables: country <chr>, continent <chr>, lifeExp <dbl>, pop <dbl>,\n#> # gdpPercap <dbl>, year <dbl>\n\nNext, we need a function that takes a single file path, reads it into R, and adds the result to the gapminder table. We can do that by combining read_excel() with DBI::dbAppendTable():\n\nappend_file <- function(path) {\n df <- readxl::read_excel(path)\n df$year <- parse_number(basename(path))\n \n DBI::dbAppendTable(con, \"gapminder\", df)\n}\n\nNow we need to call append_file() once for each element of paths. That’s certainly possible with map():\n\npaths |> map(append_file)\n\nBut we don’t care about the output of append_file(), so instead of map() it’s slightly nicer to use walk(). walk() does exactly the same thing as map() but throws the output away:\n\npaths |> walk(append_file)\n\nNow we can see if we have all the data in our table:\n\ncon |> \n tbl(\"gapminder\") |> \n count(year)\n#> # Source: SQL [?? x 2]\n#> # Database: DuckDB v0.10.0 [root@Darwin 21.6.0:R 4.3.3/:memory:]\n#> year n\n#> <dbl> <dbl>\n#> 1 1952 142\n#> 2 1957 142\n#> 3 1972 142\n#> 4 1977 142\n#> 5 1987 142\n#> 6 1967 142\n#> # ℹ more rows\n\n\n26.4.2 Writing csv files\nThe same basic principle applies if we want to write multiple csv files, one for each group. Let’s imagine that we want to take the ggplot2::diamonds data and save one csv file for each clarity. First we need to make those individual datasets. There are many ways you could do that, but there’s one way we particularly like: group_nest().\n\nby_clarity <- diamonds |> \n group_nest(clarity)\n\nby_clarity\n#> # A tibble: 8 × 2\n#> clarity data\n#> <ord> <list<tibble[,9]>>\n#> 1 I1 [741 × 9]\n#> 2 SI2 [9,194 × 9]\n#> 3 SI1 [13,065 × 9]\n#> 4 VS2 [12,258 × 9]\n#> 5 VS1 [8,171 × 9]\n#> 6 VVS2 [5,066 × 9]\n#> # ℹ 2 more rows\n\nThis gives us a new tibble with eight rows and two columns. clarity is our grouping variable and data is a list-column containing one tibble for each unique value of clarity:\n\nby_clarity$data[[1]]\n#> # A tibble: 741 × 9\n#> carat cut color depth table price x y z\n#> <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>\n#> 1 0.32 Premium E 60.9 58 345 4.38 4.42 2.68\n#> 2 1.17 Very Good J 60.2 61 2774 6.83 6.9 4.13\n#> 3 1.01 Premium F 61.8 60 2781 6.39 6.36 3.94\n#> 4 1.01 Fair E 64.5 58 2788 6.29 6.21 4.03\n#> 5 0.96 Ideal F 60.7 55 2801 6.37 6.41 3.88\n#> 6 1.04 Premium G 62.2 58 2801 6.46 6.41 4 \n#> # ℹ 735 more rows\n\nWhile we’re here, let’s create a column that gives the name of output file, using mutate() and str_glue():\n\nby_clarity <- by_clarity |> \n mutate(path = str_glue(\"diamonds-{clarity}.csv\"))\n\nby_clarity\n#> # A tibble: 8 × 3\n#> clarity data path \n#> <ord> <list<tibble[,9]>> <glue> \n#> 1 I1 [741 × 9] diamonds-I1.csv \n#> 2 SI2 [9,194 × 9] diamonds-SI2.csv \n#> 3 SI1 [13,065 × 9] diamonds-SI1.csv \n#> 4 VS2 [12,258 × 9] diamonds-VS2.csv \n#> 5 VS1 [8,171 × 9] diamonds-VS1.csv \n#> 6 VVS2 [5,066 × 9] diamonds-VVS2.csv\n#> # ℹ 2 more rows\n\nSo if we were going to save these data frames by hand, we might write something like:\n\nwrite_csv(by_clarity$data[[1]], by_clarity$path[[1]])\nwrite_csv(by_clarity$data[[2]], by_clarity$path[[2]])\nwrite_csv(by_clarity$data[[3]], by_clarity$path[[3]])\n...\nwrite_csv(by_clarity$by_clarity[[8]], by_clarity$path[[8]])\n\nThis is a little different to our previous uses of map() because there are two arguments that are changing, not just one. That means we need a new function: map2(), which varies both the first and second arguments. And because we again don’t care about the output, we want walk2() rather than map2(). That gives us:\n\nwalk2(by_clarity$data, by_clarity$path, write_csv)\n\n\n26.4.3 Saving plots\nWe can take the same basic approach to create many plots. Let’s first make a function that draws the plot we want:\n\ncarat_histogram <- function(df) {\n ggplot(df, aes(x = carat)) + geom_histogram(binwidth = 0.1) \n}\n\ncarat_histogram(by_clarity$data[[1]])\n\n\n\n\n\n\n\nNow we can use map() to create a list of many plots7 and their eventual file paths:\n\nby_clarity <- by_clarity |> \n mutate(\n plot = map(data, carat_histogram),\n path = str_glue(\"clarity-{clarity}.png\")\n )\n\nThen use walk2() with ggsave() to save each plot:\n\nwalk2(\n by_clarity$path,\n by_clarity$plot,\n \\(path, plot) ggsave(path, plot, width = 6, height = 6)\n)\n\nThis is shorthand for:\n\nggsave(by_clarity$path[[1]], by_clarity$plot[[1]], width = 6, height = 6)\nggsave(by_clarity$path[[2]], by_clarity$plot[[2]], width = 6, height = 6)\nggsave(by_clarity$path[[3]], by_clarity$plot[[3]], width = 6, height = 6)\n...\nggsave(by_clarity$path[[8]], by_clarity$plot[[8]], width = 6, height = 6)", "crumbs": [ "✅ Programar", "26  Iteration" diff --git a/sitemap.xml b/sitemap.xml index 54ad35c0e..1ccb86aa6 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -26,7 +26,7 @@ https://cienciadedatos.github.io/pt-r4ds/data-transform.html - 2024-07-29T13:29:18.629Z + 2024-07-29T23:37:59.061Z https://cienciadedatos.github.io/pt-r4ds/workflow-style.html