While it is fine to run a SINGLE microphone at 16 KHz to capture speech, if you intend to do any processing of the microphone ARRAY, then a higher rate will yield dramatically improved array functionality.
For that reason, I recommend running the array at a 48 KHz sampling rate.
I’m also assuming that all microphone array processing would be done on the FPGA: The RasPi would only need to handle a single stream filtered to the voice band.
There are other reasons to sample faster than 16 KHz, mostly to detect and reject both noise and echoes. Without external acoustic filtering to isolate the speech bands, the Nyquist-Shannon Sampling Theorem says that frequencies above and below the sample bandwidth will wrap into the sample bandwidth (aliasing). The wider you sample, the easier it will be to remove non-speech sounds (music, noise, etc.).
The best approach is to sample fast, perform direction finding on the raw data (after optional echo suppression), de-correlate to get isolated beams, then filter to yield the desired voice band, which would then be fed to a local wake-phrase recognizer and/or a cloud or local continuous speech recognizer.
There is no need for the RasPi to have to deal with anything more than a single 16 KHz sound channel coming out of MATRIX Creator FPGA (unless the user wants to, of course), but the MATRIX Creator FPGA will internally need a 48 KHz channel from each microphone to do decent microphone array processing.
Another way to look at it is by inspecting the relative physical dimensions of the array and a sound sample. The array is about 10 cm in diameter. At 48 KHz a sound sample is about .7 cm long (speed of sound / sample rate). That’s about 1/14 of the diameter, very useful for angular determination and inverse beamforming.
But at 16 KHz a sample would be 4 times longer, about 1/3 of the diameter, making beam processing very much sloppier. The extracted beams would contain sound from a much wider cone, leading to more speech recognition failures.