With the Watson Document Conversion service on Bluemix PDF, Word and HTML documents can be converted into HTML, plain text or JSON. The converted documents can be used as input to other Watson services like Concept Insights and Retrieve and Rank.
In my concept insights sample I’ve used the service to convert the downloaded HTML files into JSON. From the JSON the title and body fields were extracted and uploaded to the concept insights service. Check out the Python script convert.py file to see how to invoke the service via curl for multiple files. Here is the key part.
1
2
3
curl_cmd = 'curl -k -s %s -u %s -F "config={\\"conversion_target\\":\\"ANSWER_UNITS\\"}" -F "file=@%s" "%s"' % (VERBOSE, DOCCNV_CREDS, htmlfilename, DOCCNV_CNVURL)
process = subprocess.Popen(shlex.split(curl_cmd), stdout=subprocess.PIPE)
output = process.communicate()[0]
Check out the API explorer for samples how to invoke the service from the command line and from Java, Node and Python. There are also various customization options.
Here is a sample of the online demo.