How to quickly import large datasets into Laravel
Chances are youβve had to import a lot of data from some source, an old legacy codebase for example, into your database. These imports, if not properly optimised, can take a long time and consume a lot of resources. So lets optimise your imports.
Setting everything up
For my imports, I like to use my own Artisan Command
. This way I can simply execute the imports via the console.
However, it can sometimes be useful to implement an import as part of a DatabaseSeeder
.
class DataImportCommand extends Illuminate\Console\Command
{
protected $signature = 'data:import';
public function handle()
{
//...
}
}
The techniques we use here are also usefully outside
DatabseSeeder
,Commands
and Importer in general.
We also need some data. For this test, I simply put a JSON file photos.json
with 20000 entries in my storage
directory.
The structure of the individual entries looks like this:
[
{
"id": 1,
"title": "Quia aliquam quo quod dolor reiciendis nihil. Delectus officia perspiciatis quas aliquam et molestias autem.",
"url": "https://via.placeholder.com/640x480.png/00ee33?text=ullam",
"thumbnailUrl": "https://via.placeholder.com/640x480.png/0022bb?text=repudiandae"
}
...
]
And last but not least, we need a simple eloquent Model and Migration:
return new class extends Illuminate\Database\Migrations\Migration {
public function up()
{
Schema::create('photos', function (Illuminate\Database\Schema\Blueprint $table) {
$table->id();
$table->string('title');
$table->string('url');
$table->string('thumbnail_url');
$table->timestamps();
});
}
};
class Photo extends Illuminate\Database\Eloquent\Model
{
protected $guarded = [];
}
We will use Mass Assignment so we need to add our
$fillable
attributes here or unguard every attribute like I did.
How to measure time and memory usage
βΉοΈ If you do not care about the metrics of your importer you can skip this section
Now that everything is set up, we need a way to measure our success. We want to know how long our Import took and how much
memory we used in the process. Luckily this isnβt a hard problem to solve because Laravel added in one of it last releases a new
Benchmark
class that lets us easily measure the time a functions needs to be resolved. For the Memory consumption PHP has two build in functions
that lets us get the current and peak memory usage, memory_get_peak_usage()
and memory_get_usage()
.
public function run(InputInterface $input, OutputInterface $output): int
{
$return = 0;
$benchmarkResult = Benchmark::measure(
function () use (&$returnValue, $input, $output) {
$return = parent::run($input, $output);
}
);
$timeSpend = $benchmarkResult / 1000 .'s';
$currentMemory = $this->convert(memory_get_usage());
$peakMemory = $this->convert(memory_get_peak_usage());
$this->table(
['time', 'memory_usage', 'memory_peak_usage'],
[$timeSpend, $currentMemory, $peakMemory]
);
return $return;
}
protected function convert(int $size): string
{
$units = ['b', 'kb', 'mb', 'gb', 'tb', 'pb'];
$exponent = floor(log($size, 1024));
return round($size / pow(1024, $exponent), 2) . ' ' . $units[$exponent];
}
To conveniently get the metrics after each run of our import Command we overwrite the run()
method of the command.
Then we crated a callable which calls the parent::run()
and call the new Laravel Illuminate\Support\Benchmark
class
with our callable. The Illuminate\Support\Benchmark
will return the time the callable took to execute back in ms. Because we
want to get the returned value of our parent::run()
as well we need to store it in a variable outside the callable.
To get the current and peak memory usage we simply use the build in php functions memory_get_peak_usage()
and memory_get_usage()
and convert the output to a nicer to read format. Now we put all of this in a table to get a nice formatted console output and return
the value we got from the parent::run()
. If we run the command now it should look like this:
php artisan data:import
+--------------+--------------+-------------------+
| time | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.027612416s | 18.52 mb | 18.57 mb |
+--------------+--------------+-------------------+
The first slow implementation
With all this out of the way letβs start with the first unoptimized implementation
public function handle()
{
$data = json_decode(Storage::get('./photos.json'), true);
$this->withProgressBar(
$data,
fn(array $photo) => Photo::create([
'title' => $photo['title'],
'url' => $photo['url'],
'thumbnail_url' => $photo['thumbnailUrl'],
])
);
}
First of all we start to load our dummy data from our storage/app/photos.json
file and parse the json into an array.
Because we want a nice output in the console we use $this->withProgressBar()
function to loop through our data set. This will create a nice progress bar that gives us realtime feedback on our import status.
In the second parameter we use the static create()
method of our Model to create a new Model instance and save it directly into the
database.
Now letβs see how well our import performs:
php artisan data:import
20000/20000 [ββββββββββββββββββββββββββββ] 100%
+--------------+--------------+-------------------+
| time | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 5.017469125s | 21.69 mb | 43.45 mb |
+--------------+--------------+-------------------+
This is our baseline, letβs improve it.
Making it faster
Our Import took about 5s, not the worst, but we could do better with some small adjustments. First of all we need to identify
the part that takes so long and in this case our offender is the create()
method. If you think about it the create()
method
creates every model on its own with a dedicated SQL call. These communication between database and application take time,
it would be better if we could prepare all the data upfront and then send it with only one single SQL statement into our Database.
public function handle()
{
$data = collect(json_decode(Storage::get('./photos.json'), true));
$bar = $this->output->createProgressBar($data->count());
$bar->start();
$photoData = $data->map(function (array $photo) use ($bar) {
$bar->advance();
return Photo::make([
'title' => $photo['title'],
'url' => $photo['url'],
'thumbnail_url' => $photo['thumbnailUrl'],
])->attributesToArray();
});
Photo::query()->insert($photoData->toArray());
$bar->finish();
}
First of all we need to build up our progress bar manual because now we want to map through our dataset and not just loop
through it. I wonβt explain it here in detail, but you can read all about it here.
Then we convert our data array into a collection with the collect()
helper and map over it.
You could also use the built in
array_map()
function, but I like to work with collections more
Instead of creating the model and immediately saving it into the database we
only instantiate the model with the static make()
method and put it into the collection.
Now comes the trick, we use the insert()
function to put multiple rows of data into the database at once. The insert()
function
expects arrays that contain the attributes as keys and the data as values, so we use the toArray()
method on the Collection
to convert the Collection
and all the Models inside to arrays.
php artisan data:import
20000/20000 [ββββββββββββββββββββββββββββ] 100%
+--------------+--------------+-------------------+
| time | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.661941375s | 29.92 mb | 86.46 mb |
+--------------+--------------+-------------------+
This is a huge time improvement from 5s down to 0.6s! But we see a new problem, our memory usage increased substantially, If we wanted to import even larger datasets this could become a huge problem. So lets fix it.
Fix the Memory creep
Let`s start with a really simple improvement. At the moment we create a collection of 20000 Eloquent Models, but we never really use them and just convert them down to an array later so why not do it immediately convert them into an array which needs way less memory than a complete eloquent model:
public function handle()
{
// ...
$photoData = $data->map(function (array $photo) use ($bar) {
$bar->advance();
return Photo::make([
'title' => $photo['title'],
'url' => $photo['url'],
'thumbnail_url' => $photo['thumbnailUrl'],
])->attributesToArray();
});
//...
}
php artisan data:import
20000/20000 [ββββββββββββββββββββββββββββ] 100%
+--------------+--------------+-------------------+
| time | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.602062459s | 29.69 mb | 74.02 mb |
+--------------+--------------+-------------------+
Instantly we save bout 12mb, not bad for such a small change, but we can do even better! At the moment we process all 20000 entries at once, this does not scale well at some time we will run out of memory. The solution is chunking our input into smaller pieces and process them one by one.
public function handle()
{
//...
$data->chunk(500)
->map(function (Collection $chunk) use ($bar) {
$photoData = $chunk->map(function (array $photo) use ($bar) {
$bar->advance();
return Photo::make([
'title' => $photo['title'],
'url' => $photo['url'],
'thumbnail_url' => $photo['thumbnailUrl'],
])->toArray();
});
Photo::query()->insert($photoData->toArray());
});
//...
}
We use the chunk()
method on the Collection
and then do the same we have done on the big dataset on the smaller chunks.
php artisan data:import
20000/20000 [ββββββββββββββββββββββββββββ] 100%
+--------------+--------------+-------------------+
| time | memory_usage | memory_peak_usage |
+--------------+--------------+-------------------+
| 0.797036583s | 30.2 mb | 43.46 mb |
+--------------+--------------+-------------------+
Now we save additional 31mb and only added 0.1s to our overall time.
You can get probably even more performance out of the Importer if you use
Generators
but that is out of the scope of this Blog Post.
And that is pretty much it. Now you can import huge amounts of data without it taking multiple hours and resources.
Bonus Tipp: Update or Create if not already exist
One quick tipp, if you want your Importer to update an existing model or create a new model if no matching model exists you can
use the upsert()
method instead of insert()
.
Photo::query()
->upsert(
values: $photoData->toArray(),
uniqueBy: ['id'],
update: [
'id',
'title',
'url',
'thumbnail_url'
]
);