Text Parsing

By profession, I am a qualified lecturer in Chemistry. I've been doing it for years and one aspect which is key to understanding is the ability to understand concentration and the mole. To most people, a mole is a cute little creature that digs people's lawns up and that's about it. To a chemist, it's possibly one of the fundemental parts and it causes more problems than enough.

After teaching one particular group, I decided to sit down a write a chemical formula calculator. I know there are lots of these on the market but there was a problem. Some charge, some don't and some aren't very good. Key to mine would be a simple to use interface.

The user interface

The user would enter the formula and then the number of moles required. The program would do what it needed to and report back it's answers. Easy enough.

As with any piece of software, the majority of the effort is in the design and understanding of the problem

The problem

Take the formula CuSO4.5H2O - this is a compound called copper (II) sulfate pentahydrate. If you did science at school, you'll possibly have grown crystals with it as it has a very attractive deep blue colour.

A chemist would look at this and work out how much it weights like this:

      There is a dot, this means that there are two parts to the formula
      On the left, there is a Cu, an S and an O followed by a 4. As there are no numbers before the Cu and the S this means there is only one of these.
      On the right there is an H followed by a 2 and then an O with nothing after it.
      Before the H though, there is a 5. This means that there is 5 lots of whatever follows, so there is now 10 lots of H and 5 lots of O
      Overall then, there is 1 Cu, 1 S, 9 O and 10 H.
      Find the mass of each element, multiply it by the number of it and add the lot together

This sort of calculation can be performed in a matter of moments by any chemistry student - in fact, a non-chemist can do it as long as they know what each element weights, it's just simple maths

A program though has to handle things differently...

For a start, elements always start with a capital letter, but don't have to just have one letter. Brackets also have to be understood, and what happens if there is a dot at the end?

With a bit of logic and quite a bit of paper, the solution wasn't that hard... Programming the solution