A helpful trick: the @__DIR__ macro gives us the directory of the current file (the @__FILE__ macro gives us the full file path). In Jupyter it's equivalent toe pwd()
@__DIR__
"/Users/blaschke/Developer/hpc-julia/docs/julia for data science/01_data"
Let's point the notebooks to our data source:
data_directory = joinpath(@__DIR__, "..", "..", "..", "exercises", "covid", "data")
"/Users/blaschke/Developer/hpc-julia/docs/julia for data science/01_data/../../../exercises/covid/data"
Let's make a temporary working directory (ie. a scratch space):
temp_directory = mktempdir()
"/var/folders/gy/fk8y1bkd5b78l0n687jwhzkc0029yh/T/jl_H93q4P"
This is really basic stuff -- you won't need to do this often at all, but it's good to see nonetheless.
First, let's make sure that the target directory is empty:
readdir(temp_directory)
String[]
Now we write a list of integers to a.dat in our temp_directory:
file_path = joinpath(temp_directory, "a.dat")
"/var/folders/gy/fk8y1bkd5b78l0n687jwhzkc0029yh/T/jl_H93q4P/a.dat"
a = [4, 2, 3]
3-element Vector{Int64}:
4
2
3
write(file_path, a)
24
There it is:
readdir(temp_directory)
1-element Vector{String}:
"a.dat"
Let's read it:
read(file_path, Int64)
4
Shoot! That only read the first integer! The problem is that read doesn't know where to stop, so it reads the data size which you told it (Int64). For example, let's interpret the file as a null-terminated string:
read(file_path, String)
"\x04\0\0\0\0\0\0\0\x02\0\0\0\0\0\0\0\x03\0\0\0\0\0\0\0"
So to read an array of data, we need a loop that checks if we've reached the end of the file
open(file_path, "r") do io
while !eof(io)
print(read(io, Int64), ",")
end
end
4,2,3,
JSON is a really poular format for structured data. It's a bit clunky, but it's human readable, and pretty much every web API uses it, so we'll just have to live with the clunk. Check out the documentation here: https://github.com/JuliaIO/JSON.jl
using JSON
Let's make a basic structured data object -- Eg. a dictionary:
d = Dict(
"a"=>1,
"b"=>"hello"
)
Dict{String, Any} with 2 entries:
"b" => "hello"
"a" => 1
Which can be encoded as a JSON:
json(d)
"{\"b\":\"hello\",\"a\":1}"
Which is just a string -- so we can read and write it as a string:
json_file_path = joinpath(temp_directory, "d.json")
write(json_file_path, json(d))
19
Which is just a string in a file, so we can read it as such.
d_string = read(json_file_path, String)
JSON.parse(d_string)
Dict{String, Any} with 2 entries:
"b" => "hello"
"a" => 1
CSV is another text-based format that is the defacto standard for sharing small to medium amounts of data. Check out the documentation here: https://csv.juliadata.org/stable/ and https://dataframes.juliadata.org/stable/
using DataFrames
using CSV
CSV is great for tabular data! And DataFrames are (imo) the best way to program against tabular data:
df = DataFrame(name=String[], age=Float64[], coffees=Int64[])
0 rows × 3 columns
| name | age | coffees | |
|---|---|---|---|
| String | Float64 | Int64 |
Let's start adding data to our dataframe:
push!(df, ("Johannes", 36.5, 10))
1 rows × 3 columns
| name | age | coffees | |
|---|---|---|---|
| String | Float64 | Int64 | |
| 1 | Johannes | 36.5 | 10 |
push!(df, ("Christin", 34.1, 2))
2 rows × 3 columns
| name | age | coffees | |
|---|---|---|---|
| String | Float64 | Int64 | |
| 1 | Johannes | 36.5 | 10 |
| 2 | Christin | 34.1 | 2 |
Which we can now save to disk (again as text):
coffee_file_path = joinpath(temp_directory, "coffee.csv")
CSV.write(coffee_file_path, df)
"/var/folders/gy/fk8y1bkd5b78l0n687jwhzkc0029yh/T/jl_H93q4P/coffee.csv"
readdir(temp_directory)
3-element Vector{String}:
"a.dat"
"coffee.csv"
"d.json"
Lets look at the CSV file's content:
open(joinpath(temp_directory, "coffee.csv")) do io
for line in readlines(io)
println(line)
end
end
name,age,coffees Johannes,37.5,10 Christin,34.1,2
Loading a DataFrame from disk involves first creating a CSV.File object and then piping it into a DataFrame
CSV.File(coffee_file_path)
2-element CSV.File: CSV.Row: (name = "Johannes", age = 36.5, coffees = 10) CSV.Row: (name = "Christin", age = 34.1, coffees = 2)
CSV.File(coffee_file_path) |> DataFrame
2 rows × 3 columns
| name | age | coffees | |
|---|---|---|---|
| String15 | Float64 | Int64 | |
| 1 | Johannes | 36.5 | 10 |
| 2 | Christin | 34.1 | 2 |
Let's read a more hefty data source:
readdir(data_directory)
1-element Vector{String}:
"total-covid-cases-deaths-per-million.csv"
Which we can now read into a single DataFrame:
data = CSV.File(joinpath(data_directory, "total-covid-cases-deaths-per-million.csv")) |> DataFrame
dropmissing!(data) # Data is never perfect!
141,467 rows × 5 columns (omitted printing of 1 columns)
| Entity | Code | Day | Total confirmed deaths due to COVID-19 per million people | |
|---|---|---|---|---|
| String | String15 | Date | Float64 | |
| 1 | Afghanistan | AFG | 2020-03-23 | 0.025 |
| 2 | Afghanistan | AFG | 2020-03-24 | 0.025 |
| 3 | Afghanistan | AFG | 2020-03-25 | 0.025 |
| 4 | Afghanistan | AFG | 2020-03-26 | 0.05 |
| 5 | Afghanistan | AFG | 2020-03-27 | 0.05 |
| 6 | Afghanistan | AFG | 2020-03-28 | 0.05 |
| 7 | Afghanistan | AFG | 2020-03-29 | 0.1 |
| 8 | Afghanistan | AFG | 2020-03-30 | 0.1 |
| 9 | Afghanistan | AFG | 2020-03-31 | 0.1 |
| 10 | Afghanistan | AFG | 2020-04-01 | 0.1 |
| 11 | Afghanistan | AFG | 2020-04-02 | 0.1 |
| 12 | Afghanistan | AFG | 2020-04-03 | 0.126 |
| 13 | Afghanistan | AFG | 2020-04-04 | 0.126 |
| 14 | Afghanistan | AFG | 2020-04-05 | 0.176 |
| 15 | Afghanistan | AFG | 2020-04-06 | 0.176 |
| 16 | Afghanistan | AFG | 2020-04-07 | 0.276 |
| 17 | Afghanistan | AFG | 2020-04-08 | 0.351 |
| 18 | Afghanistan | AFG | 2020-04-09 | 0.377 |
| 19 | Afghanistan | AFG | 2020-04-10 | 0.377 |
| 20 | Afghanistan | AFG | 2020-04-11 | 0.377 |
| 21 | Afghanistan | AFG | 2020-04-12 | 0.452 |
| 22 | Afghanistan | AFG | 2020-04-13 | 0.477 |
| 23 | Afghanistan | AFG | 2020-04-14 | 0.552 |
| 24 | Afghanistan | AFG | 2020-04-15 | 0.628 |
| 25 | Afghanistan | AFG | 2020-04-16 | 0.728 |
| 26 | Afghanistan | AFG | 2020-04-17 | 0.753 |
| 27 | Afghanistan | AFG | 2020-04-18 | 0.753 |
| 28 | Afghanistan | AFG | 2020-04-19 | 0.753 |
| 29 | Afghanistan | AFG | 2020-04-20 | 0.828 |
| 30 | Afghanistan | AFG | 2020-04-21 | 0.904 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
describe gives us a high-level overview:
describe(data)
5 rows × 7 columns (omitted printing of 4 columns)
| variable | mean | min | |
|---|---|---|---|
| Symbol | Union… | Any | |
| 1 | Entity | Afghanistan | |
| 2 | Code | ABW | |
| 3 | Day | 2020-01-22 | |
| 4 | Total confirmed deaths due to COVID-19 per million people | 527.4 | 0.001 |
| 5 | Total confirmed cases of COVID-19 per million people | 36548.1 | 0.018 |
Let's say we want to extract a Column:
data[:, "Total confirmed cases of COVID-19 per million people"]
141467-element Vector{Float64}:
1.004
1.054
1.858
2.008
2.284
2.661
2.862
2.862
4.167
4.82
5.899
6.753
6.778
⋮
16168.118
16197.272
16197.272
16212.711
16212.711
16230.799
16246.437
16276.32
16276.32
16287.915
16295.005
16302.625
Let's extract only those columns matching a country code:
data[data[:, "Code"] .== "USA", :]
760 rows × 5 columns (omitted printing of 1 columns)
| Entity | Code | Day | Total confirmed deaths due to COVID-19 per million people | |
|---|---|---|---|---|
| String | String15 | Date | Float64 | |
| 1 | United States | USA | 2020-02-29 | 0.003 |
| 2 | United States | USA | 2020-03-01 | 0.003 |
| 3 | United States | USA | 2020-03-02 | 0.018 |
| 4 | United States | USA | 2020-03-03 | 0.021 |
| 5 | United States | USA | 2020-03-04 | 0.033 |
| 6 | United States | USA | 2020-03-05 | 0.036 |
| 7 | United States | USA | 2020-03-06 | 0.042 |
| 8 | United States | USA | 2020-03-07 | 0.051 |
| 9 | United States | USA | 2020-03-08 | 0.063 |
| 10 | United States | USA | 2020-03-09 | 0.066 |
| 11 | United States | USA | 2020-03-10 | 0.084 |
| 12 | United States | USA | 2020-03-11 | 0.099 |
| 13 | United States | USA | 2020-03-12 | 0.129 |
| 14 | United States | USA | 2020-03-13 | 0.153 |
| 15 | United States | USA | 2020-03-14 | 0.174 |
| 16 | United States | USA | 2020-03-15 | 0.21 |
| 17 | United States | USA | 2020-03-16 | 0.291 |
| 18 | United States | USA | 2020-03-17 | 0.403 |
| 19 | United States | USA | 2020-03-18 | 0.583 |
| 20 | United States | USA | 2020-03-19 | 0.799 |
| 21 | United States | USA | 2020-03-20 | 1.117 |
| 22 | United States | USA | 2020-03-21 | 1.427 |
| 23 | United States | USA | 2020-03-22 | 1.811 |
| 24 | United States | USA | 2020-03-23 | 2.373 |
| 25 | United States | USA | 2020-03-24 | 3.103 |
| 26 | United States | USA | 2020-03-25 | 4.103 |
| 27 | United States | USA | 2020-03-26 | 5.356 |
| 28 | United States | USA | 2020-03-27 | 6.924 |
| 29 | United States | USA | 2020-03-28 | 9.065 |
| 30 | United States | USA | 2020-03-29 | 10.732 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |