Friday, July 18, 2014

Encode XML

Problem

Since XML is very verbose, you are given a way of encoding it where each tag gets mapped to a pre-defined integer value The language/grammar is as follows:

Element –> Element Attr* END Element END [aka, encode the element tag, then its attributes, then tack on an END character, then encode its children, then another end tag]
Attr –> Tag Value [assume all values are strings]
END –> 01
Tag –> some predefined mapping to int
Value –> string value END

Write code to print the encoded version of an xml element (passed in as string).

FOLLOW UP
Is there anything else you could do to (in many cases) compress this even further?

Solution

We all know xml has lots of redendency, of-course I don't want to enter into json vs xml here. Its a compression based algorithm.

private Map<String, Byte> tagMap;
private static final Byte[] END = { 0, 1 };
private List<String> tokens;
private int currentTokenIndex;
 
byte[] encode(char[] input) throws IOException {
    tokenize(input);
    currentTokenIndex = 0;
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    encodeTokens(outputStream);
    return outputStream.toByteArray();
}
 
void encodeTokens(ByteArrayOutputStream output) throws IOException {
    nextToken("<");
    // read tag name
    String tagName = nextToken();
    output.write(getTagCode(tagName));
    // read attributes
    while (!hasNextToken(">") && !hasNextTokens("/", ">")) { 
       // read next attribute
        String key = nextToken();
        nextToken("=");
        String value = nextToken();
        output.write(getTagCode(key));
        for (char c : value.toCharArray()) {
            output.write(c);
        }
        output.write(END[0]);
        output.write(END[1]);
    }
    // end of attributes
    output.write(END[0]);
    output.write(END[1]);
    // finish this element
    if (hasNextTokens("/", ">")) {
        nextToken("/");
        nextToken(">");
    } else {
        nextToken(">");
        // while not the end tag
        while (!hasNextTokens("<", "/")) {
            encodeTokens(output); // encode child
        }
        // ending tag
        nextToken("<");
        nextToken("/");
        nextToken(tagName);
        nextToken(">");
    }
    output.write(END[0]);
    output.write(END[1]);
}

Thanks
References

0 comments:

Post a Comment