Julia can call any function in a shared C library. This requires the library to be built using the -shared
and -fPIC
modes. The library has to either be in the current working directory, or in the system PATH
(or the dynamic linker's path, depending on your OS).
For example, the standard library libc
provides the clock
function. Note: the colon needs to be provided in :clock
to prevent julia from passing the clock
function pointer (and subsequently failing in no such function is defined). Thus the colon is a symbol:
typeof(:clock)
Symbol
We call the clock function using the ccall
command:
ccall((:clock, "libc"), Int32, ())
12204446
Which calls the clock
function in libc
, expecting an Int32
return type, and passing 0 arguments (hence the blank tuple as the last argument). We could wrap this into a function:?
f() = ccall((:clock, "libc"), Int32, ())
f (generic function with 1 method)
f()
12896611
Whose llvm intermediate representation, and the assembly code, shows us what's going on:
@code_llvm(f())
; @ In[3]:1 within `f` define i32 @julia_f_1439() #0 { top: %0 = call i32 inttoptr (i64 140703431076467 to i32 ()*)() ret i32 %0 }
@code_native(f())
.section __TEXT,__text,regular,pure_instructions .build_version macos, 12, 0 .globl _julia_f_1481 ## -- Begin function julia_f_1481 .p2align 4, 0x90 _julia_f_1481: ## @julia_f_1481 ; ┌ @ In[3]:1 within `f` .cfi_startproc ## %bb.0: ## %top subq $8, %rsp .cfi_def_cfa_offset 16 movabsq $140703431076467, %rax ## imm = 0x7FF812072A73 callq *%rax popq %rcx retq .cfi_endproc ; └ ## -- End function .subsections_via_symbols
Here we see that we are calling a function starting at the position $clock
(the linker handles all of this) and returning the result
More complex data types, such as strings, might not exist in Julia. In the case of strings, julia provieds a Cstring
data type, which is binary compatible with c. Here we also see what happens when we want to pass a single input argument: the call signature tuple (second-to-last argument) needs to match the function being called, and the inputs are appended as varargs:
ret = ccall((:getenv, "libc"), Cstring, (Cstring,), "SHELL")
Cstring(0x00007ff7bafc14b9)
Which can be coverted back into a Julia string using unsafe_string
.
unsafe_string(ret)
"/usr/local/bin/fish"
Note that if the function returns NULL
(which is matched to C_NULL
in julia-land):
ret = ccall((:getenv, "libc"), Cstring, (Cstring,), "asdf")
Cstring(0x0000000000000000)
... then unsafe_string
throws an error:
unsafe_string(ret)
ArgumentError: cannot convert NULL to string Stacktrace: [1] unsafe_string @ ./strings/string.jl:72 [inlined] [2] unsafe_string(s::Cstring) @ Base ./c.jl:193 [3] top-level scope @ In[10]:1 [4] eval @ ./boot.jl:368 [inlined] [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String) @ Base ./loading.jl:1428
Which means that we need to sanitize our ccall
whenever this can happen:
function getenv(var::AbstractString)
val = ccall((:getenv, "libc"),
Cstring, (Cstring,), var)
if val == C_NULL
error("getenv: undefined variable: ", var)
end
unsafe_string(val)
end
getenv (generic function with 1 method)
getenv("SHELL")
"/usr/local/bin/fish"
getenv("asdf")
getenv: undefined variable: asdf Stacktrace: [1] error(::String, ::String) @ Base ./error.jl:44 [2] getenv(var::String) @ Main ./In[11]:5 [3] top-level scope @ In[13]:1 [4] eval @ ./boot.jl:368 [inlined] [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String) @ Base ./loading.jl:1428
Let's look at the "Hello World" of shared library: adding 1 to an integer. This is accomplished by the asdf_int
function in asdf.c
. After compiling in place with gcc -shared -fPIC -o asdf_lib.dylib asdf.c
we can call it:
ccall((:asdf_int, "asdf_lib"), Int32, (Int32,), -1)
0
Note that in the example above, I had to use the .dylib
extension because I'm using a mac: https://stackoverflow.com/questions/2339679/what-are-the-differences-between-so-and-dylib-on-osx
But this example is rather boring. What about sending arrays? Consider the sum function sum(a)
, which computes
$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i,
$$
where $n$ is the length of a
.
a = rand(10^7) # 1D vector of random numbers, uniform on [0,1)
10000000-element Vector{Float64}: 0.8613616678836064 0.6788806822948108 0.47735441120249267 0.9872453487938009 0.0939569660165096 0.9975542610320133 0.9359560690972771 0.13179595010979905 0.3870351677953733 0.30137106295064764 0.0827855699267499 0.05531853697162825 0.9348339001953763 ⋮ 0.7703630608529817 0.785851212129005 0.9956362623582224 0.4591075308839966 0.8398996934654701 0.9995945836053647 0.12824102300712326 0.9404965731289516 0.6715917272434537 0.0025913133378624442 0.5697798774608598 0.9956946163870105
sum(a)
4.998520918420762e6
Let's do this in C (I'm a bit lazy ... so I'll define the C code here rather than in its own file...). Note that the julia function tempname
returns the path to a temporary file (depending on your OS). We can pipe our C code into gcc and write the output into our temporary library, that's what
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
print(f, C_code)
end
does. Note also that we're optimising the code, and we're asking gcc for the dynamic library extension using Libdl.dlext
. As a final note: in Julia, string concatenation is done with the times *(::AbstractString,::AbstractString)
function.
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
s += X[i];
}
return s;
}
"""
const Clib = tempname() # make a temporary file
# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
print(f, C_code)
end
# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(
("c_sum", Clib), Float64,
(Csize_t, Ptr{Float64}), length(X), X
)
c_sum (generic function with 1 method)
Note above the inline function c_sum
which acts as a wrapper for ccall
:
ccall(
("c_sum", Clib), Float64,
(Csize_t, Ptr{Float64}), length(X), X
)
here we do several things:
c_sum
function is defined in the temporary library with the path stored in Clib
length
returns an integer for the 1D size of the array at X
, this integer is mapped to the C size type size_t
(which is labled as Csize_t
in Julia).Ptr{Float64}
.@code_llvm(c_sum(a))
; @ In[18]:23 within `c_sum` define double @julia_c_sum_2217({}* nonnull align 16 dereferenceable(40) %0) #0 { top: ; ┌ @ array.jl:215 within `length` %1 = bitcast {}* %0 to { i8*, i64, i16, i16, i32 }* %2 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %1, i64 0, i32 1 %3 = load i64, i64* %2, align 8 ; └ ; ┌ @ pointer.jl:65 within `unsafe_convert` %4 = bitcast {}* %0 to i8** %5 = load i8*, i8** %4, align 8 %6 = ptrtoint i8* %5 to i64 ; └ %7 = call double inttoptr (i64 4382146336 to double (i64, i64)*)(i64 %3, i64 %6) ret double %7 }
@code_native(c_sum(a))
.section __TEXT,__text,regular,pure_instructions .build_version macos, 12, 0 .globl _julia_c_sum_2220 ## -- Begin function julia_c_sum_2220 .p2align 4, 0x90 _julia_c_sum_2220: ## @julia_c_sum_2220 ; ┌ @ In[18]:23 within `c_sum` .cfi_startproc ## %bb.0: ## %top subq $8, %rsp .cfi_def_cfa_offset 16 ; │┌ @ pointer.jl:65 within `unsafe_convert` movq (%rdi), %rsi ; │└ ; │┌ @ array.jl:215 within `length` movq 8(%rdi), %rdi movabsq $4382146336, %rax ## imm = 0x105323F20 ; │└ callq *%rax popq %rax retq .cfi_endproc ; └ ## -- End function .subsections_via_symbols
Let's compare the result with Julia's internal sum
function:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbolb
true
Let's look at some benchmarks comparing c_sum
with Julia's inbuilt sum
:
c_bench = @benchmark c_sum(a)
BenchmarkTools.Trial: 312 samples with 1 evaluation. Range (min … max): 15.628 ms … 19.708 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 15.808 ms ┊ GC (median): 0.00% Time (mean ± σ): 16.053 ms ± 649.531 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▃▇█▆▅▄▃ ▂ ███████▇█▇▆█▁▇▄▆▆▄▄▆█▆▄▄▆▁▇▄▁▄▁▆▄▄▄▄▁▄▁▁▁▄▄▄▁▄▄▄▁▁▄▄▁▁▄▁▁▄▄▄ ▇ 15.6 ms Histogram: log(frequency) by time 18.7 ms < Memory estimate: 48 bytes, allocs estimate: 1.
j_bench = @benchmark sum(a)
BenchmarkTools.Trial: 974 samples with 1 evaluation. Range (min … max): 4.411 ms … 10.278 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 4.885 ms ┊ GC (median): 0.00% Time (mean ± σ): 5.122 ms ± 748.226 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅█▅▆▂▅▂▄▁▁ ▆██████████▇█▅▆▆▅▄▄▄▅▃▄▄▃▃▃▃▃▃▃▃▂▃▃▂▂▃▃▂▁▂▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂ ▄ 4.41 ms Histogram: frequency by time 7.81 ms < Memory estimate: 16 bytes, allocs estimate: 1.
Oh boy... it seems that c is slower than Julia?! Well not quite, as we'll see later. The reason for why the c_sum
function is a bit slower than Julia's internal sum
is most likely because c_sum
relies on dynamically linked code. When a compiler optimizes, it needs to know specifics about how the code is implemented. Furthermore, the ccall
function has a little overhead which due to type ineroperability.
This is not a Julia specific overhead... Any dynamically linked library call would experience this overhead (so don't go blaming Julia). In the future, inlining will be supported:
https://docs.julialang.org/en/stable/manual/calling-c-and-fortran-code/index.html
Let's do the sum_f function which evalues a Julia function $f$ as it goes:
$$
\mathrm{sum\_f}(a) = \sum_{i=1}^n f(a_i),
$$
where $n$ is the length of a
and $f$ is a function defined in Julia
In order to pass a function pointer correctly, we need to make sure that the Julia function takes and returns C-compatible data types. This is shown in the following example:
function j_sqrt(x::Cdouble)
convert(Cdouble, sqrt(x))::Cdouble
end
j_sqrt (generic function with 1 method)
... which works just like a real square root:
j_sqrt(2.) ≈ √2.
true
... but it also does a little work converting the output of Julia's inbuilt sqrt
function to a Cdouble
. In order to pass a Julia function to C, it needs to be converted into a function pointer. This has as Julia data type of Ptr{Void}
, but don't worry, when we define our C function, we declare the output of the function pointer to have the right type.
const j_sqrt_c = @cfunction(j_sqrt, Cdouble, (Cdouble,))
Ptr{Nothing} @0x000000010793d360
Now we can pass a function pointer to our C function by modifying our c_sum
function call signature:
double c_sqrt_sum(double (* j_sqrt)(double), size_t n, double *X)
where the first argument is a function pointer to a function with call signature double j_sqrt(double)
. Note that a function on the "other end" of a function pointer is called using the usual function j_sqrt(x)
syntax.
C_sqrt_code = """
#include <stddef.h>
double c_sqrt_sum(double (* j_sqrt)(double), size_t n, double *X) {
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
s += j_sqrt(X[i]);
}
return s;
}
"""
const Clib_sqrt = tempname() # make a temporary file
# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib_sqrt * "." * Libdl.dlext) -`, "w") do f
print(f, C_sqrt_code)
end
# define a Julia function that calls the C function:
c_sqrt_sum(X::Array{Float64}) = ccall(
("c_sqrt_sum", Clib_sqrt),
Float64,
(Ptr{Nothing}, Csize_t, Ptr{Float64}),
j_sqrt_c, length(X), X
)
c_sqrt_sum (generic function with 1 method)
Note above that we also modified the call signature to include the function pointer (as well as passing the Julia function pointer j_sqrt_c
).
Let's compare the performance of our C function with a Julia-only implenentation:
j_sqrt_sum(X::Array{Float64}) = sum( broadcast( sqrt, X ) )
j_sqrt_sum (generic function with 1 method)
c_sqrt_sum(a) ≈ j_sqrt_sum(a)
true
@benchmark(c_sqrt_sum($a))
BenchmarkTools.Trial: 64 samples with 1 evaluation. Range (min … max): 72.559 ms … 107.022 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 77.179 ms ┊ GC (median): 0.00% Time (mean ± σ): 78.712 ms ± 6.748 ms ┊ GC (mean ± σ): 0.00% ± 0.00% █ █ ▃▃ ▁▃▃ ▇█▆█▄▆▆██▄███▁▁▄▆▁▄▆▁▁▁▁▄▁▄▁▁▄▁▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁ 72.6 ms Histogram: frequency by time 106 ms < Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark(j_sqrt_sum($a))
BenchmarkTools.Trial: 117 samples with 1 evaluation. Range (min … max): 34.171 ms … 56.576 ms ┊ GC (min … max): 0.00% … 26.07% Time (median): 42.539 ms ┊ GC (median): 0.48% Time (mean ± σ): 42.940 ms ± 6.606 ms ┊ GC (mean ± σ): 13.98% ± 12.45% ▂▆ ▂▆▄█▂ ▂ ▂ ▂▂▄ ▂ ████████▆██▄▄▆▁▁▁▄▄▄▁▁▄█▁▁▁▁▁▄▄▆▆█▄███▆▆▆█▄█▆▁█▆▆▆▆▁▆▄▁▆▁▁▄ ▄ 34.2 ms Histogram: frequency by time 55.3 ms < Memory estimate: 76.29 MiB, allocs estimate: 2.
Hrm... so it seems that we're paying a price for constantly calling a function pointer, rather than a pure c function. Let's verify our theory by implementing a purely C form of sum_f
(i.e. we're not going to be calling the Julia sqrt
function, and instead we'll use the one defined in math.h
).
Cc_sqrt_code = """
#include <stddef.h>
#include <math.h>
double cc_sqrt_sum(size_t n, double *X) {
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
s += sqrt(X[i]);
}
return s;
}
"""
const Clib_c_sqrt = tempname() # make a temporary file
# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):
open(`gcc -fPIC -O3 -msse3 -xc -shared -lm -o $(Clib_c_sqrt * "." * Libdl.dlext) -`, "w") do f
print(f, Cc_sqrt_code)
end
# define a Julia function that calls the C function:
cc_sqrt_sum(X::Array{Float64}) = ccall(
("cc_sqrt_sum", Clib_c_sqrt),
Float64,
(Csize_t, Ptr{Float64}),
length(X), X
)
cc_sqrt_sum (generic function with 1 method)
@benchmark(c_sqrt_sum($a))
BenchmarkTools.Trial: 67 samples with 1 evaluation. Range (min … max): 72.338 ms … 104.730 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 73.986 ms ┊ GC (median): 0.00% Time (mean ± σ): 75.805 ms ± 5.319 ms ┊ GC (mean ± σ): 0.00% ± 0.00% █▁ ██▆▅▇▅▇▄▅▇▄▁▁▁▃▁▁▃▁▃▄▁▃▃▃▁▄▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ ▁ 72.3 ms Histogram: frequency by time 95 ms < Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark(cc_sqrt_sum($a))
BenchmarkTools.Trial: 202 samples with 1 evaluation. Range (min … max): 24.082 ms … 28.802 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 24.452 ms ┊ GC (median): 0.00% Time (mean ± σ): 24.770 ms ± 833.654 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▃▄▄█▄▂ ██████▇▇▄▅▃▄▄▃▅▃▃▁▃▅▄▃▁▃▄▃▃▃▃▄▁▃▃▃▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▃ ▃ 24.1 ms Histogram: frequency by time 28.6 ms < Memory estimate: 0 bytes, allocs estimate: 0.
Ha! Now we're getting a good speed up. This shows that the performance pentalty for Julia functions isn't bad (and until we have inlining of C functions, the might not always be a speedup by going to C). Yet in some cases, simple compute kernels written entirely in C might improve performance.