Table of Contents |
---|
...
Introduction
With S3 buckets, accessing data is even easier than before. For those who want to use Python (encouraged), it is easy as pie.
Make sure you have both Python 3 and the access keys to your S3 bucket ready. Typically you'll find your access key and secret key in Morpheus under Tools -> Cypher.
Pre-requisites
Code Block |
---|
python3 -m pip install boto3 |
Access private buckets that require s3 credentials
Take a look at this code segment, which allows you to access the bucket, list its objects and upload/download files from it.
Before running any Python code, install the boto3 library:
Start by declaring some initial values for boto3 to know where your bucket is located at. Feel free to copy paste this segment and fill in with your own values.
If you're connecting to buckets hosted at the EUMETSAT side of the European Weather Cloud, the endpoint is: https://s3.waw3-1.cloudferro.com
Code Block | ||
---|---|---|
| ||
import os
import io
import boto3
#Initializing some values
project_id = '123' #Fill this in
bucketname = 'MyFancyBucket123' #Fill this in
access_key = '123asdf' #Fill this in
secret_access_key = '123asdf111' #Fill this in
endpoint = 'https://my-s3-endpoint.com' #Fill this in |
Lets start by initializing the S3 client with our access keys and endpoint:
Code Block | ||
---|---|---|
| ||
#Initialize the S3 client
s3 = boto3.client('s3', endpoint_url=endpoint,
aws_access_key_id = access_key,
aws_secret_access_key = secret_access_key) |
As a first step, and to confirm we have successfully connected, lets list the objects inside our bucket (up to a 1.000 objects).
Code Block | ||
---|---|---|
| ||
#List the objects in our bucket
response = s3.list_objects(Bucket=bucketname)
for item in response['Contents']:
print(item['Key']) |
If you'd want to list more than 1000 objects in a bucket, you can use paginator:
Code Block | ||
---|---|---|
| ||
#List objects with paginator (not constrained to a 1000 objects)
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=bucketname)
#Lets store the names of our objects inside a list
objects = []
for page in pages:
for obj in page['Contents']:
objects.append(obj["Key"])
print('Number of objects: ', len(objects)) |
Where an obj looks like this:
Code Block | ||
---|---|---|
| ||
{'Key': 'MyFile.txt', 'LastModified': datetime.datetime(2021, 11, 11, 0, 39, 23, 320000, tzinfo=tzlocal()), 'ETag': '"2e22f62675cea3445f7e24818a4f6ba0d6-1"', 'Size': 1013, 'StorageClass': 'STANDARD'} |
Now lets try to read a file from a bucket into Python's memory, so we can work with it inside Python without ever saving the file to our local computer:
Code Block | ||
---|---|---|
| ||
#Read a file into Python's memory and open it as a string
filename = '/folder1/folder2/myfile.txt' #Fill this in
obj = s3.get_object(Bucket=bucketname, Key=filename)
myObject = obj['Body'].read().decode('utf-8')
print(myObject) |
But if you'd want to download the file instead of reading it into memory, here's how you'd do that:
Code Block | ||
---|---|---|
| ||
#Downloading a file from the bucket
with open('myfile', 'wb') as f: #Fill this in
s3.download_fileobj(bucketname, 'myfile', f) |
And similarly you can upload files to the bucket (given that you have write access to the bucket):
Code Block | ||
---|---|---|
| ||
#Uploading a file to the bucket (make sure you have write access)
response = s3.upload_file('myfile', bucketname, 'myfile') #Fill this in |
And lastly, creating a bucket (this could take some time):
Code Block |
---|
s3.create_bucket(Bucket="MyBucket") |
If you're interested in streaming netCDF files directly from S3 buckets, give these two examples a look:
Code Block | ||
---|---|---|
| ||
import netCDF4 as nc
import xarray as xr
import boto3
import tempfile
s3 = boto3.client('s3', endpoint_url=endpoint,
aws_access_key_id = access_key,
aws_secret_access_key = secret_access_key)
tmp = tempfile.NamedTemporaryFile()
tc = boto3.s3.transfer.TransferConfig(io_chunksize=2621440)
with open(tmp.name, 'wb') as f:
s3.download_fileobj(bucketname, filename, f, Config=tc)
dataSet = xr.open_dataset(tmp.name, engine='netcdf4')
|
And an alternative and shorter version:
Code Block |
---|
import smart_open
bucketpath=smart_f = f"s3://{access_key}:{secret_access_key}@{endpoint}@{bucketname}/{obj_name}"
smart_f = smart_open.open(bucketpath, 'rb')
import h5py
h=h5py.File(smart_f)
print(h.keys()) |
If you're interested in more, I recommend taking a look at this article, which gives you a more detailed view into boto3's functionality (although it does emphasize on Amazon Web Services specifically, you can take a look at the Python code involved):
https://dashbird.io/blog/boto3-aws-python/
Check out a full code example at the official boto3 website:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html
You can also see a differently styled tutorial at:
https://towardsdatascience.com/introduction-to-pythons-boto3-c5ac2a86bb63
Access public buckets (no credentials required)
...
(Simple Storage Service) is a highly scalable, object-based cloud storage service used for storing and retrieving data.
There are different way of performing actions on files (e.g. upload, download, read, delete) or bucket specifically (e.g. create bucket) on s3 buckets:
- general packages (s3cmd, rclone, awscli): s3 from command line
- specific clients' packages (e.g. boto3 or s3fs Python libraries): s3 using Python libraries.
Related articles
Content by Label | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...