File I/O¶

(Back to Overview)

Interacting with the Filesystem¶

A helpful trick: the @__DIR__ macro gives us the directory of the current file (the @__FILE__ macro gives us the full file path). In Jupyter it's equivalent toe pwd()

In [1]:
@__DIR__
Out[1]:
"/Users/blaschke/Developer/hpc-julia/docs/julia for data science/01_data"

Let's point the notebooks to our data source:

In [2]:
data_directory = joinpath(@__DIR__, "..", "..", "..", "exercises", "covid", "data")
Out[2]:
"/Users/blaschke/Developer/hpc-julia/docs/julia for data science/01_data/../../../exercises/covid/data"

Let's make a temporary working directory (ie. a scratch space):

In [3]:
temp_directory = mktempdir()
Out[3]:
"/var/folders/gy/fk8y1bkd5b78l0n687jwhzkc0029yh/T/jl_H93q4P"

Read and Writing to Files Directly¶

This is really basic stuff -- you won't need to do this often at all, but it's good to see nonetheless.

First, let's make sure that the target directory is empty:

In [4]:
readdir(temp_directory)
Out[4]:
String[]

Now we write a list of integers to a.dat in our temp_directory:

In [5]:
file_path = joinpath(temp_directory, "a.dat")
Out[5]:
"/var/folders/gy/fk8y1bkd5b78l0n687jwhzkc0029yh/T/jl_H93q4P/a.dat"
In [6]:
a = [4, 2, 3]
Out[6]:
3-element Vector{Int64}:
 4
 2
 3
In [7]:
write(file_path, a)
Out[7]:
24

There it is:

In [56]:
readdir(temp_directory)
Out[56]:
1-element Vector{String}:
 "a.dat"

Let's read it:

In [8]:
read(file_path, Int64)
Out[8]:
4

Shoot! That only read the first integer! The problem is that read doesn't know where to stop, so it reads the data size which you told it (Int64). For example, let's interpret the file as a null-terminated string:

In [11]:
read(file_path, String)
Out[11]:
"\x04\0\0\0\0\0\0\0\x02\0\0\0\0\0\0\0\x03\0\0\0\0\0\0\0"

So to read an array of data, we need a loop that checks if we've reached the end of the file

In [12]:
open(file_path, "r") do io
    while !eof(io)
        print(read(io, Int64), ",")
    end
end
4,2,3,

JSON¶

JSON is a really poular format for structured data. It's a bit clunky, but it's human readable, and pretty much every web API uses it, so we'll just have to live with the clunk. Check out the documentation here: https://github.com/JuliaIO/JSON.jl

In [13]:
using JSON

Let's make a basic structured data object -- Eg. a dictionary:

In [16]:
d = Dict(
    "a"=>1,
    "b"=>"hello"
)
Out[16]:
Dict{String, Any} with 2 entries:
  "b" => "hello"
  "a" => 1

Which can be encoded as a JSON:

In [17]:
json(d)
Out[17]:
"{\"b\":\"hello\",\"a\":1}"

Which is just a string -- so we can read and write it as a string:

In [18]:
json_file_path = joinpath(temp_directory, "d.json")
write(json_file_path, json(d))
Out[18]:
19

Which is just a string in a file, so we can read it as such.

In [19]:
d_string = read(json_file_path, String)
JSON.parse(d_string)
Out[19]:
Dict{String, Any} with 2 entries:
  "b" => "hello"
  "a" => 1

CSV¶

CSV is another text-based format that is the defacto standard for sharing small to medium amounts of data. Check out the documentation here: https://csv.juliadata.org/stable/ and https://dataframes.juliadata.org/stable/

In [22]:
using DataFrames
using CSV

CSV is great for tabular data! And DataFrames are (imo) the best way to program against tabular data:

In [23]:
df = DataFrame(name=String[], age=Float64[], coffees=Int64[])
Out[23]:

0 rows × 3 columns

nameagecoffees
StringFloat64Int64

Let's start adding data to our dataframe:

In [24]:
push!(df, ("Johannes", 36.5, 10))
Out[24]:

1 rows × 3 columns

nameagecoffees
StringFloat64Int64
1Johannes36.510
In [25]:
push!(df, ("Christin", 34.1, 2))
Out[25]:

2 rows × 3 columns

nameagecoffees
StringFloat64Int64
1Johannes36.510
2Christin34.12

Which we can now save to disk (again as text):

In [26]:
coffee_file_path = joinpath(temp_directory, "coffee.csv")
CSV.write(coffee_file_path, df)
Out[26]:
"/var/folders/gy/fk8y1bkd5b78l0n687jwhzkc0029yh/T/jl_H93q4P/coffee.csv"
In [27]:
readdir(temp_directory)
Out[27]:
3-element Vector{String}:
 "a.dat"
 "coffee.csv"
 "d.json"

Lets look at the CSV file's content:

In [20]:
open(joinpath(temp_directory, "coffee.csv")) do io
    for line in readlines(io)
        println(line)
    end
end
name,age,coffees
Johannes,37.5,10
Christin,34.1,2

Loading a DataFrame from disk involves first creating a CSV.File object and then piping it into a DataFrame

In [28]:
CSV.File(coffee_file_path)
Out[28]:
2-element CSV.File:
 CSV.Row: (name = "Johannes", age = 36.5, coffees = 10)
 CSV.Row: (name = "Christin", age = 34.1, coffees = 2)
In [31]:
CSV.File(coffee_file_path) |> DataFrame
Out[31]:

2 rows × 3 columns

nameagecoffees
String15Float64Int64
1Johannes36.510
2Christin34.12

DataFrames¶

Let's read a more hefty data source:

In [30]:
readdir(data_directory)
Out[30]:
1-element Vector{String}:
 "total-covid-cases-deaths-per-million.csv"

Which we can now read into a single DataFrame:

In [34]:
data = CSV.File(joinpath(data_directory, "total-covid-cases-deaths-per-million.csv")) |> DataFrame
dropmissing!(data) # Data is never perfect!
Out[34]:

141,467 rows × 5 columns (omitted printing of 1 columns)

EntityCodeDayTotal confirmed deaths due to COVID-19 per million people
StringString15DateFloat64
1AfghanistanAFG2020-03-230.025
2AfghanistanAFG2020-03-240.025
3AfghanistanAFG2020-03-250.025
4AfghanistanAFG2020-03-260.05
5AfghanistanAFG2020-03-270.05
6AfghanistanAFG2020-03-280.05
7AfghanistanAFG2020-03-290.1
8AfghanistanAFG2020-03-300.1
9AfghanistanAFG2020-03-310.1
10AfghanistanAFG2020-04-010.1
11AfghanistanAFG2020-04-020.1
12AfghanistanAFG2020-04-030.126
13AfghanistanAFG2020-04-040.126
14AfghanistanAFG2020-04-050.176
15AfghanistanAFG2020-04-060.176
16AfghanistanAFG2020-04-070.276
17AfghanistanAFG2020-04-080.351
18AfghanistanAFG2020-04-090.377
19AfghanistanAFG2020-04-100.377
20AfghanistanAFG2020-04-110.377
21AfghanistanAFG2020-04-120.452
22AfghanistanAFG2020-04-130.477
23AfghanistanAFG2020-04-140.552
24AfghanistanAFG2020-04-150.628
25AfghanistanAFG2020-04-160.728
26AfghanistanAFG2020-04-170.753
27AfghanistanAFG2020-04-180.753
28AfghanistanAFG2020-04-190.753
29AfghanistanAFG2020-04-200.828
30AfghanistanAFG2020-04-210.904
⋮⋮⋮⋮⋮

describe gives us a high-level overview:

In [36]:
describe(data)
Out[36]:

5 rows × 7 columns (omitted printing of 4 columns)

variablemeanmin
SymbolUnion…Any
1EntityAfghanistan
2CodeABW
3Day2020-01-22
4Total confirmed deaths due to COVID-19 per million people527.40.001
5Total confirmed cases of COVID-19 per million people36548.10.018

Let's say we want to extract a Column:

In [35]:
data[:, "Total confirmed cases of COVID-19 per million people"]
Out[35]:
141467-element Vector{Float64}:
     1.004
     1.054
     1.858
     2.008
     2.284
     2.661
     2.862
     2.862
     4.167
     4.82
     5.899
     6.753
     6.778
     ⋮
 16168.118
 16197.272
 16197.272
 16212.711
 16212.711
 16230.799
 16246.437
 16276.32
 16276.32
 16287.915
 16295.005
 16302.625

Let's extract only those columns matching a country code:

In [37]:
data[data[:, "Code"] .== "USA", :]
Out[37]:

760 rows × 5 columns (omitted printing of 1 columns)

EntityCodeDayTotal confirmed deaths due to COVID-19 per million people
StringString15DateFloat64
1United StatesUSA2020-02-290.003
2United StatesUSA2020-03-010.003
3United StatesUSA2020-03-020.018
4United StatesUSA2020-03-030.021
5United StatesUSA2020-03-040.033
6United StatesUSA2020-03-050.036
7United StatesUSA2020-03-060.042
8United StatesUSA2020-03-070.051
9United StatesUSA2020-03-080.063
10United StatesUSA2020-03-090.066
11United StatesUSA2020-03-100.084
12United StatesUSA2020-03-110.099
13United StatesUSA2020-03-120.129
14United StatesUSA2020-03-130.153
15United StatesUSA2020-03-140.174
16United StatesUSA2020-03-150.21
17United StatesUSA2020-03-160.291
18United StatesUSA2020-03-170.403
19United StatesUSA2020-03-180.583
20United StatesUSA2020-03-190.799
21United StatesUSA2020-03-201.117
22United StatesUSA2020-03-211.427
23United StatesUSA2020-03-221.811
24United StatesUSA2020-03-232.373
25United StatesUSA2020-03-243.103
26United StatesUSA2020-03-254.103
27United StatesUSA2020-03-265.356
28United StatesUSA2020-03-276.924
29United StatesUSA2020-03-289.065
30United StatesUSA2020-03-2910.732
⋮⋮⋮⋮⋮
In [ ]: