There are 2 configurations. The one where there are 12 people, and the one where there are 13 people. The one where there are 12 people is the "normal" image and the one where there are 13 people is the "modified" image.
In the normal image, each person has a "top" and a "bottom". For each of the 12 people, the amount of pixels for the person on bottom and top are as follows:
Now, the images are rearranged as follows to form 13 people:
(NPERSON is the new number of the new person, BOTTOM is the PERSON number whose bottom is taken, TOP is the PERSON number whose top is taken)
As you can see, the bottom of 1 is used to form a whole person (since the top of that person is an insignificant graphic) and the top of 10 is used to form a whole person (since the bottom of that person is an insignificant graphic). The left-over bottom and top are matched up to the person who has the next least significant bottom or top.
Volume is conserved because the people in the 13-people configuration are much shorter. This is not noticed because the drawings are ugly anyhow.
It would be a lot simpler to understand if the people were sorted from "most bottom pixels and least top pixels" to "least bottom pixels and most top pixels". In this case, instead of a complicated two areas swapping, you'd just shift the top part to the right, which would take off the top 5% off of the leftmost guy, and the top 95% off of the rightmost guy. That extra 95% of the rightmost guy will look like his own guy (and the leftmost guy still looks like a guy).