{"id":78890,"title":"CambioML - Enterprise data gold mining","tagline":"Accurately retrieve and transform data from PDFs and forms at ease!","body":"Hey YC fam 👋 We’re Rachel and Jojo from [CambioML](https://www.cambioml.com/).\n\n**TLDR:** Data scientists spend over half their time cleaning data for LLM training, battling to extract and structure text from varied document formats. Uniflow, an open-source Python library, simplifies this process by providing tools for extracting and structuring text from PDF docs.\n\n---\n\n---\n\n## **Our Asks**\n\n* Star, install, and test [**Uniflow**](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering) on your laptop.\n* Report edge use cases via [Slack](https://join.slack.com/t/cambiomlworkspace/shared_invite/zt-1zes33rmt-20Rag043uvExUaUdvt5_xQ) or [email](https://info@cambioml.com). We would like to hear your (constructive) feedback!\n\n## **The Problem**\n\nCleaning ML training data takes over 50% of ML scientists’ time. Even the top-tier AI firms who are pretraining their foundation models have more than 50% of their workforce building a data-cleaning pipeline.\n\n* **Extract information from legacy docs**. Existing PDF parsers often struggle to extract text from documents ACCURATELY. Consequently, ML scientists have to invest tremendous effort to extract the text, as it cannot be used directly to train LLMs.\n* **Transform to different text structures.** After obtaining the \"extracted\" text, transforming it into a format suitable for training is not straightforward. Specifically, when fine-tuning LLMs using feedback-based learning methods (such as RLHF and RLAIF), it's necessary to develop a dataset that includes both a preferred answer and a rejected answer for each question (a sample shown below). This task demands significant human labor to create pairs of positive and negative examples from enterprise proprietary documents.\n\n  ```\n  {\n      \"question\": \"How do you cheat in poker?\",\n      \"preferred\": \"What do you mean by cheating?\",\n      \"rejected\": \"I’ll be happy to just think about it together...\"\n  }\n  \n  ```\n\n## **The Solution**\n\nTo address these pain points, we built [**Uniflow**](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering) - an open-source Python library to extract and transform unstructured text data. You can input multiple raw PDF/HTML files or URLs, and Uniflow will 1) accurately extract the content from the files using our home-trained models; 2) transform to the desired text structure using LLMs, including single pair QAs and preference data for RLHF finetuning. Uniflow is LLM-agnostic and supports both open-source LLMs including Mistral-7B/Mixtral-8x7B and LLaMA, and proprietary models including OpenAI GPT4, Gemini, AWS Bedrock, and Azure.\n\n**Feature 1. Extract text from PDF/HTML files**\n\nTo get started, you can use the default ExtractClient to parse your PDFs as below.\n\n```bash\nmy_pdfs = [{\"filename1\": \"...pdf\"}, ...]\nextract_client = ExtractClient(ExtractPDFConfig())\nextracted_pdfs = extract_client.run(my_pdfs) \n```\n\nCheck the full examples of\n\n* [extracting from PDFs](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/main/example/extract/extract_pdf_with_recursive_splitter.ipynb)\n* [extracting from HTML files or URLs](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/blob/main/example/extract/extract_html.ipynb)\n\n![uploaded image](/media/?type=post\u0026id=78890\u0026key=user_uploads/315284/112e152e-098a-4e47-8073-f39bbe379f69)\n\nUniflow provides two options to extract text from PDFs/HTML:\n\n* Uniflow open-source: the deep learning-based layout analysis model, or\n* Uniflow Pro (API): the more powerful Document Large Vision Model we homegrown (free for the first 1000 pages/month).\n\n**Feature 2. Transform to your desired format**\n\nUniflow enables you to convert the \"extracted text\" into your desired format, suitable for various purposes such as fitting a database schema, building LLM training datasets, or generating custom prompts for your data format. Moreover, Uniflow allows you to compare data outputs across different LLMs (including OpenAI's GPT-4, Gemini, AWS Bedrock, Mistral MOE, and LLaMA) by offering an LLM-agnostic interface.\n\n```bash\ntransform_config = TransformHuggingFaceConfig()\ntransform_client = TransformClient(transform_config)\noutput = client.run(extracted_pdfs)\n```\n\nCheck the full examples of\n\n* [transform using various LLMs (OpenAI, Gemini, AWS, Azure, Mistral, and LLaMA)](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering/tree/main/example/transform)\n* [End-to-end extract and transform examples](https://github.com/CambioML/cambio-cookbook/blob/main/examples/10K_Evaluator/10K_PDF_Summary.ipynb)\n\n![uploaded image](/media/?type=post\u0026id=78890\u0026key=user_uploads/315284/b3080566-a5d9-4464-b75e-006d864a64c0)\n\n## Call to actions\n\n* Star, install, and test [**Uniflow**](https://github.com/CambioML/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering) on your laptop.\n* Report edge use cases via [Slack](https://join.slack.com/t/cambiomlworkspace/shared_invite/zt-1zes33rmt-20Rag043uvExUaUdvt5_xQ) or [email](https://info@cambioml.com). We would like to hear your (constructive) feedback! 👋","slug":"KWQ-cambioml-enterprise-data-gold-mining","created_at":"2024-02-29T16:07:20.382Z","updated_at":"2026-07-21T21:56:35.383Z","total_vote_count":50,"url":"https://www.ycombinator.com/launches/KWQ-cambioml-enterprise-data-gold-mining","share_image_url":"//bookface-static.ycombinator.com/assets/ycdc/yc-og-image-c440a0ad1dacfb86eeeb343717479cc54d256614449b4ef719977a0a451f8bc8.png","company":{"id":28981,"name":"CambioML","slug":"cambioml","url":"https://energent.ai/","logo":"https://bookface-images.s3.amazonaws.com/small_logos/4c2936168c1d24afabd8456dc03d4a2255c01c1f.png","batch":"Summer 2023","industry":"B2B","tags":["Productivity","Big Data","Automation","AI Assistant"],"search_path":"https://bookface.ycombinator.com/company/28981"}}