Generating random alphanumeric profanity free codes using pthreads in PHP

A friend of mine recently forwarded me an offer he received for the generation of 50 million random codes: alphanumeric, 10 characters long, unique and not containing any profane words. Price tag: same as a brand new mid-range car. Lol. Hold my beer, I’ll do this. 🙂

Not too long after writing the first line, the SQLite table filled with codes in a breeze…. until I cranked up the profanity check from a couple of test badwords to a real world scenario list of around 2500 finest German swear words. It absolutely killed the performance, while the CPU utilization did not even hit the 30% mark. Being to lazy to rewrite everything in e. g. Java, I started to take a look at ways to bring multi-threading to PHP. This led me to pthreads, a project providing multi-threading based on Posix Threads. Motivation follows action, action follows laziness and voilà: the code generator is now able to utilize all available processing power. Combined with a few tweaks of the bad word dictionary, it dramatically reduced the time needed to finish the job. A test run on my old i7 4something took two and a half hours (using this English profanity list and requiring a minimum Shannon entropy of theoretically 2.2 bits per character).

The whole project and its output can be downloaded below. Make sure to install pthreads first. The script configuration is done in the Config.php. Also note that pthreads projects can be run via CLI only.

A couple of learnings made:

* Use a multi-threading language in the first place when thinking about solving highly repetitive tasks.
* Use random_int() instead of rand(). Using rand() will quickly lead you to duplicate codes as it does not generate cryptographically secure values.
* Create objects, that need to be passed into a pthreads worker, in the calling context and keep a reference. Objects created in a thread scope constructor will be destroyed to avoid memory issues.
* Combining multiple SQL INSERTs to one transaction will take way less time than inserting one by one.
* Having an idea about the statistical probability of hitting a duplicate code or unwanted word, helps balancing out the efforts taken to avoid them. Keep in mind that every constraint will make it easier to guess a code.

DOWNLOAD: CodeGenerator Project
DOWNLOAD: 50 mio codes (1.1 GB, zipped)

Leave a Reply