Skip to content

Commit

Permalink
Bumping version to 0.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
igorborgest committed Feb 4, 2020
1 parent f045512 commit ee1809a
Show file tree
Hide file tree
Showing 37 changed files with 8,620 additions and 112 deletions.
95 changes: 48 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,73 +2,74 @@

> DataFrames on AWS
[![Release](https://img.shields.io/badge/release-0.2.6-brightgreen.svg)](https://pypi.org/project/awswrangler/)
[![Release](https://img.shields.io/badge/release-0.3.0-brightgreen.svg)](https://pypi.org/project/awswrangler/)
[![Downloads](https://img.shields.io/pypi/dm/awswrangler.svg)](https://pypi.org/project/awswrangler/)
[![Python Version](https://img.shields.io/badge/python-3.6%20%7C%203.7-brightgreen.svg)](https://pypi.org/project/awswrangler/)
[![Documentation Status](https://readthedocs.org/projects/aws-data-wrangler/badge/?version=latest)](https://aws-data-wrangler.readthedocs.io/en/latest/?badge=latest)
[![Coverage](https://img.shields.io/badge/coverage-89%25-brightgreen.svg)](https://pypi.org/project/awswrangler/)
[![Average time to resolve an issue](http://isitmaintained.com/badge/resolution/awslabs/aws-data-wrangler.svg)](http://isitmaintained.com/project/awslabs/aws-data-wrangler "Average time to resolve an issue")
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

**[Read the Docs!](https://aws-data-wrangler.readthedocs.io)**
## [Read the Docs](https://aws-data-wrangler.readthedocs.io)

**[Read the Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials): [Catalog & Metadata](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/catalog_and_metadata.ipynb) | [Athena Nested](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/athena_nested.ipynb) | [S3 Write Modes](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/s3_write_modes.ipynb)**
## [Read the Tutorials](https://github.com/awslabs/aws-data-wrangler/tree/master/tutorials)
- [Catalog & Metadata](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/catalog_and_metadata.ipynb)
- [Athena Nested](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/athena_nested.ipynb)
- [S3 Write Modes](https://github.com/awslabs/aws-data-wrangler/blob/master/tutorials/s3_write_modes.ipynb)

---

*Contents:* **[Use Cases](#Use-Cases)** | **[Installation](#Installation)** | **[Examples](#Examples)** | **[Diving Deep](#Diving-Deep)** | **[Step By Step](#Step-By-Step)** | **[Contributing](#Contributing)**

---
## Contents
- [Use Cases](#Use-Cases)
- [Installation](#Installation)
- [Examples](#Examples)
- [Diving Deep](#Diving-Deep)
- [Step By Step](#Step-By-Step)
- [Contributing](#Contributing)

## Use Cases

### Pandas

* Pandas -> Parquet (S3) (Parallel)
* Pandas -> CSV (S3) (Parallel)
* Pandas -> Glue Catalog Table
* Pandas -> Athena (Parallel)
* Pandas -> Redshift (Append/Overwrite/Upsert) (Parallel)
* Pandas -> Aurora (MySQL/PostgreSQL) (Append/Overwrite) (Via S3) (NEW :star:)
* Parquet (S3) -> Pandas (Parallel)
* CSV (S3) -> Pandas (One shot or Batching)
* Glue Catalog Table -> Pandas (Parallel)
* Athena -> Pandas (One shot, Batching or Parallel)
* Redshift -> Pandas (Parallel)
* CloudWatch Logs Insights -> Pandas
* Aurora -> Pandas (MySQL) (Via S3) (NEW :star:)
* Encrypt Pandas Dataframes on S3 with KMS keys
* Glue Databases Metadata -> Pandas (Jupyter output compatible)
* Glue Table Metadata -> Pandas (Jupyter output compatible)
| FROM | TO | Features |
|--------------------------|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Pandas DataFrame | Amazon S3 | Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes,<br>KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto) |
| Amazon S3 | Pandas DataFrame| Parquet (Pushdown filters), CSV, Partitions, Parallelism,<br>KMS Encryption, Multiple files |
| Amazon Athena | Pandas DataFrame| Workgroups, S3 output path, Encryption, and two different engines:<br><br>- ctas_approach=False **->** Batching and restrict memory environments<br>- ctas_approach=True **->** Blazing fast, parallelism and enhanced data types |
| Pandas DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenes<br>Append/Overwrite/Upsert modes |
| Amazon Redshift | Pandas DataFrame| Blazing fast using parallel parquet on S3 behind the scenes |
| Pandas DataFrame | Amazon Aurora | Supported engines: MySQL, PostgreSQL<br>Blazing fast using parallel CSV on S3 behind the scenes<br>Append/Overwrite modes |
| Amazon Aurora | Pandas DataFrame| Supported engines: MySQL<br>Blazing fast using parallel CSV on S3 behind the scenes |
| CloudWatch Logs Insights | Pandas DataFrame| Query results |
| Glue Catalog | Pandas DataFrame| List and get Tables details. Good fit with Jupyter Notebooks. |

### PySpark

* PySpark -> Redshift (Parallel)
* Register Glue table from Dataframe stored on S3
* Flatten nested DataFrames
| FROM | TO | Features |
|-----------------------------|---------------------------|------------------------------------------------------------------------------------------|
| PySpark DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes |
| PySpark DataFrame | Glue Catalog | Register Parquet or CSV DataFrame on Glue Catalog |
| Nested PySpark<br>DataFrame | Flat PySpark<br>DataFrames| Flatten structs and break up arrays in child tables |

### General

* List S3 objects (Parallel)
* Delete S3 objects (Parallel)
* Delete listed S3 objects (Parallel)
* Delete NOT listed S3 objects (Parallel)
* Copy listed S3 objects (Parallel)
* Get the size of S3 objects (Parallel)
* Get CloudWatch Logs Insights query results
* Load partitions on Athena/Glue table (repair table)
* Create EMR cluster (For humans)
* Terminate EMR cluster
* Get EMR cluster state
* Submit EMR step(s) (For humans)
* Get EMR step state
* Get EMR step state
* Athena query to receive the result as python primitives (*Iterable[Dict[str, Any]*)
* Load and Unzip SageMaker jobs outputs
* Load and Unzip SageMaker models
* Redshift -> Parquet (S3)
* Aurora -> CSV (S3) (MySQL) (NEW :star:)
* Get Glue Metadata
| Feature | Details |
|---------------------------------------------|-------------------------------------|
| List S3 objects | e.g. wr.s3.list_objects("s3://...") |
| Delete S3 objects | Parallel |
| Delete listed S3 objects | Parallel |
| Delete NOT listed S3 objects | Parallel |
| Copy listed S3 objects | Parallel |
| Get the size of S3 objects | Parallel |
| Get CloudWatch Logs Insights query results | |
| Load partitions on Athena/Glue table | Through "MSCK REPAIR TABLE" |
| Create EMR cluster | "For humans" |
| Terminate EMR cluster | "For humans" |
| Get EMR cluster state | "For humans" |
| Submit EMR step(s) | "For humans" |
| Get EMR step state | "For humans" |
| Query Athena to receive python primitives | Returns *Iterable[Dict[str, Any]* |
| Load and Unzip SageMaker jobs outputs | |
| Dump Amazon Redshift as Parquet files on S3 | |
| Dump Amazon Aurora as CSV files on S3 | Only for MySQL engine |

## Installation

Expand Down
2 changes: 1 addition & 1 deletion awswrangler/__version__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__title__ = "awswrangler"
__description__ = "DataFrames on AWS."
__version__ = "0.2.6"
__version__ = "0.3.0"
__license__ = "Apache License 2.0"
2 changes: 1 addition & 1 deletion awswrangler/pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -831,7 +831,7 @@ def _cast_pandas(dataframe: pd.DataFrame, cast_columns: Dict[str, str]) -> pd.Da
elif pandas_type == "date":
dataframe[col] = pd.to_datetime(dataframe[col]).dt.date.replace(to_replace={pd.NaT: None})
else:
dataframe[col] = dataframe[col].astype(pandas_type, skipna=True)
dataframe[col] = dataframe[col].astype(pandas_type)
return dataframe

@staticmethod
Expand Down
7 changes: 7 additions & 0 deletions docs/source/api/awswrangler.dynamodb.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
awswrangler.dynamodb module
===========================

.. automodule:: awswrangler.dynamodb
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs/source/api/awswrangler.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Submodules
awswrangler.aurora
awswrangler.cloudwatchlogs
awswrangler.data_types
awswrangler.dynamodb
awswrangler.emr
awswrangler.exceptions
awswrangler.glue
Expand Down
Loading

0 comments on commit ee1809a

Please sign in to comment.