Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions PROJETO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Web Scraping - Portal da Transparência

Este projeto foi desenvolvido como solução para o desafio de Back End da Agilize.

O projeto tem duas partes principais, que são o scraper feito com Python e a API feita com Laravel/Lumen 9. Para o banco de dados utilizado foi o SQLite.

Para testar o projeto, basta seguir as instruções do *Road Map*, ademais é possível ver a gravação da aplicação rodando clicando [AQUI](https://www.youtube.com/watch?v=S8d_hGRXP9o).

- [Pré-requisitos](#pré-requisitos)
- [Road Map](#road map)
- [Estrutura do arquivo .db](#estrutura do arquivo government_expenses.db)
- [Rota](#rota)



## Pré-requisitos

Firefox ( Este navegador é necessário, pois o webdriver utiliza ele para fazer o web scraping ).

```
https://www.mozilla.org/pt-BR/firefox/new/
```

Python v.3.10.2

```
https://www.python.org/downloads/
```

PHP >= v.8.0

```
https://www.php.net/downloads.php
```

Composer v.2.2.7

```
https://getcomposer.org/download/
```



## Road Map

1. Realize o clone do repositório.

2. No terminal digite:

```
git checkout isaac-jordao
pip install -r requirements.txt
```

3. Após a instalação dos pacotes, ainda no terminal, execute o seguinte comando:

```
python scraper.py
```



Durante a execução do scraper, irão aparecer algumas mensagens no terminal informando o que está acontecendo, e ao fim da execução, será criado o arquivo **government_expenses.db** com os registros de todas as páginas da tabela do Portal da Transparência, caso já exista o arquivo, os dados serão substituídos pelos novos.

**Observação**: algumas vezes a tabela presente no site não carrega muito bem, por isso, caso demore muito para aparecer a segunda mensagem no terminal, reinicie o comando do passo 3.



**A partir de agora, todos os comandos do terminal serão executados dentro da pasta *web-scraping-api***



4. Utilize o comando para entrar na pasta da API.

```
cd ./web-scraping-api
```

5. Utilize o comando abaixo para instalar todas as dependências necessárias para executar a API:

```
composer install
```

6. Para iniciar o servidor:

```
php -S localhost:8000 -t public
```

7. No Postman, Insomnia ou até mesmo no navegador, cole o seguinte endereço para acessar o resultado do web scraping:

```
http://localhost:8000/api/dados
```



## Estrutura do arquivo `government_expenses.db`

{
"index": 0,
"mes_ano": "01/2022",
"programa_orcamentario": "63000 - Advocacia-Geral da União",
"acao_orcamentaria": "63000 - Advocacia-Geral da União - Unidades com vínculo direto",
"valor_empenhado": "3.159.878.346,60",
"valor_liquidado": "277.492.902,95",
"valor_pago": "99.768.376,98",
"valor_restos_a_pagar_pagos": "265.370.842,94"
}



## Rota

[GET] /api/dados - Retorno de todos os dados da tabela.
Binary file added geckodriver
Binary file not shown.
Binary file added geckodriver.exe
Binary file not shown.
294 changes: 294 additions & 0 deletions geckodriver.log

Large diffs are not rendered by default.

Binary file added government_expenses.db
Binary file not shown.
Binary file added requirements.txt
Binary file not shown.
57 changes: 57 additions & 0 deletions scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
from selenium.webdriver.firefox.options import Options
import pandas as pd
from selenium.webdriver.support.ui import Select
from sqlalchemy import create_engine

options = Options()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options)

print('--------Carregando a página')
browser.get(
'https://www.transparencia.gov.br/despesas/orgao?ordenarPor=orgaoSuperior&direcao=asc')
sleep(5)

browser.execute_script(
"""document.getElementsByName('lista_length')[0].innerHTML='<OPTION value="1">1 resultado</OPTION><OPTION value="999999999">Todos os resultados</OPTION>';""")
print('--------Executando o script para alterar a option e mostrar todos os resultados da tabela')
select = Select(browser.find_element(By.NAME, 'lista_length'))
select.select_by_visible_text('Todos os resultados')
sleep(15)

page_content = browser.page_source
site = BeautifulSoup(page_content, 'html.parser')

data_info = []

data_table = site.find('tbody')

rows = data_table.findAll('tr')

for row in rows:
row_info = row.findAll(
['span'], attrs={'data-html': 'true'})

row_info = [info.text for info in row_info]
row_info.pop(0)
month_year = row_info[0]
superior_agency = row_info[1]
linked_entity = row_info[2]
value_mortgage = row_info[3]
value_settled = row_info[4]
amount_paid = row_info[5]
amount_to_pay = row_info[6]

data_info.append([month_year, superior_agency, linked_entity,
value_mortgage, value_settled, amount_paid, amount_to_pay])

engine = create_engine('sqlite:///government_expenses.db', echo=False)

query_data = pd.DataFrame(data_info, columns=[
'mes_ano', 'programa_orcamentario', 'acao_orcamentaria', 'valor_empenhado', 'valor_liquidado', 'valor_pago', 'valor_restos_a_pagar_pagos'])
query_data.to_sql('expenses', con=engine, if_exists='replace')
print('--------Banco de dados criado com sucesso')
15 changes: 15 additions & 0 deletions web-scrapping-api/.editorconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
root = true

[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
indent_style = space
indent_size = 4
trim_trailing_whitespace = true

[*.md]
trim_trailing_whitespace = false

[*.{yml,yaml}]
indent_size = 2
16 changes: 16 additions & 0 deletions web-scrapping-api/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
APP_NAME=Lumen
APP_ENV=local
APP_KEY=
APP_DEBUG=true
APP_URL=http://localhost:8000
APP_TIMEZONE=UTC

LOG_CHANNEL=stack
LOG_SLACK_WEBHOOK_URL=

DB_CONNECTION=sqlite
DB_DATABASE=../../government_expenses.db

CACHE_DRIVER=file
QUEUE_CONNECTION=sync

6 changes: 6 additions & 0 deletions web-scrapping-api/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/vendor
/.idea
Homestead.json
Homestead.yaml
.env
.phpunit.result.cache
6 changes: 6 additions & 0 deletions web-scrapping-api/.styleci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
php:
preset: laravel
disabled:
- unused_use
js: true
css: true
24 changes: 24 additions & 0 deletions web-scrapping-api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Lumen PHP Framework

[![Build Status](https://travis-ci.org/laravel/lumen-framework.svg)](https://travis-ci.org/laravel/lumen-framework)
[![Total Downloads](https://img.shields.io/packagist/dt/laravel/lumen-framework)](https://packagist.org/packages/laravel/lumen-framework)
[![Latest Stable Version](https://img.shields.io/packagist/v/laravel/lumen-framework)](https://packagist.org/packages/laravel/lumen-framework)
[![License](https://img.shields.io/packagist/l/laravel/lumen)](https://packagist.org/packages/laravel/lumen-framework)

Laravel Lumen is a stunningly fast PHP micro-framework for building web applications with expressive, elegant syntax. We believe development must be an enjoyable, creative experience to be truly fulfilling. Lumen attempts to take the pain out of development by easing common tasks used in the majority of web projects, such as routing, database abstraction, queueing, and caching.

## Official Documentation

Documentation for the framework can be found on the [Lumen website](https://lumen.laravel.com/docs).

## Contributing

Thank you for considering contributing to Lumen! The contribution guide can be found in the [Laravel documentation](https://laravel.com/docs/contributions).

## Security Vulnerabilities

If you discover a security vulnerability within Lumen, please send an e-mail to Taylor Otwell at [email protected]. All security vulnerabilities will be promptly addressed.

## License

The Lumen framework is open-sourced software licensed under the [MIT license](https://opensource.org/licenses/MIT).
Empty file.
29 changes: 29 additions & 0 deletions web-scrapping-api/app/Console/Kernel.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
<?php

namespace App\Console;

use Illuminate\Console\Scheduling\Schedule;
use Laravel\Lumen\Console\Kernel as ConsoleKernel;

class Kernel extends ConsoleKernel
{
/**
* The Artisan commands provided by your application.
*
* @var array
*/
protected $commands = [
//
];

/**
* Define the application's command schedule.
*
* @param \Illuminate\Console\Scheduling\Schedule $schedule
* @return void
*/
protected function schedule(Schedule $schedule)
{
//
}
}
10 changes: 10 additions & 0 deletions web-scrapping-api/app/Events/Event.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?php

namespace App\Events;

use Illuminate\Queue\SerializesModels;

abstract class Event
{
use SerializesModels;
}
16 changes: 16 additions & 0 deletions web-scrapping-api/app/Events/ExampleEvent.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<?php

namespace App\Events;

class ExampleEvent extends Event
{
/**
* Create a new event instance.
*
* @return void
*/
public function __construct()
{
//
}
}
54 changes: 54 additions & 0 deletions web-scrapping-api/app/Exceptions/Handler.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
<?php

namespace App\Exceptions;

use Illuminate\Auth\Access\AuthorizationException;
use Illuminate\Database\Eloquent\ModelNotFoundException;
use Illuminate\Validation\ValidationException;
use Laravel\Lumen\Exceptions\Handler as ExceptionHandler;
use Symfony\Component\HttpKernel\Exception\HttpException;
use Throwable;

class Handler extends ExceptionHandler
{
/**
* A list of the exception types that should not be reported.
*
* @var array
*/
protected $dontReport = [
AuthorizationException::class,
HttpException::class,
ModelNotFoundException::class,
ValidationException::class,
];

/**
* Report or log an exception.
*
* This is a great spot to send exceptions to Sentry, Bugsnag, etc.
*
* @param \Throwable $exception
* @return void
*
* @throws \Exception
*/
public function report(Throwable $exception)
{
parent::report($exception);
}

/**
* Render an exception into an HTTP response.
*
* @param \Illuminate\Http\Request $request
* @param \Throwable $exception
* @return \Illuminate\Http\Response|\Illuminate\Http\JsonResponse
*
* @throws \Throwable
*/
public function render($request, Throwable $exception)
{
return parent::render($request, $exception);
}
}
10 changes: 10 additions & 0 deletions web-scrapping-api/app/Http/Controllers/Controller.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
<?php

namespace App\Http\Controllers;

use Laravel\Lumen\Routing\Controller as BaseController;

class Controller extends BaseController
{
//
}
16 changes: 16 additions & 0 deletions web-scrapping-api/app/Http/Controllers/ExpensesController.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use Illuminate\Database\Eloquent\ModelNotFoundException;

use App\Models\Expenses;

class ExpensesController extends Controller {
public function index () {
$results = Expenses::get();

return response()->json($results);
}
}
Loading