Dear Colleagues,
I’d like to introduce BrPoliCorpus, my current project. It comprises 441.644.155 million words and is an open data set of Brazilian political documents. It is intended to be useful for linguists and political scientists.
At this project stage, distribution is via the R package or freely downloadable spreadsheets. The spreadsheets contain the texts and all the metadata, which is useful for context. The datasets include:
- Ordinary congress floor sessions
- Ordinary congress committees
- CPI (Parliamentary Inquiry Commission)
- Presidential inaugural speeches
- Government programmes for the offices of Governor and President of the Republic
Here is a breakdown of its current status:
Doc | Types | Tokens |
---|---|---|
CPI | 2615392 | 4563382 |
Parliamentary Committees | 7985000 | 108251624 |
Floor Parliamentary speeches | 3423405 | 322893136 |
Gov. Programmes | 688342 | 5849807 |
Inaugural Speeches | 31959 | 86206 |
Total | 14.744.098 | 441.644.155 |
The corpus is available here.
Please, contact me if any further information is needed.
All the best,
Rodrigo